Professional Testing, Inc.
Providing High Quality Examination Programs

From the Item Bank

The Professional Testing Blog

 

Translating Item Analyses into Meaningful Discussions

January 7, 2016  | By  | 

As testing professionals, we are often communicating statistical information about items on examination forms to stakeholders involved in the examination process. This information includes a range of topics such as item difficulty and discrimination as well as simple multiple choice item option frequencies.  While it is clear to those of us in the field of certification and licensure what these statistical properties mean, they can be less clear to those stakeholders who only encounter them during an exam committee meeting or even a board meeting.  It is our responsibility as testing professionals to demystify these numbers and provide a meaningful discussion on the relevance and importance of the data examinations yield.

Examinations produce a wealth of data that can help stakeholders improve their items and examinations. While examinations can take many forms, for the purposes of this discussion let’s focus on the context of multiple choice items.  Data from multiple choice items can be statistically analyzed at both the item level and the exam level. This blog will focus on the item level analyses. More specifically, the following are typical statistics associated with item review: item difficulty, item discrimination, and option frequencies.

Item difficulty or p-value of an item helps us understand how hard or easy an item has been in the past.  This statistical value is usually reported in a proportion, but is quite often explained in terms of the percentage of examinees who correctly answered the item.  This can be useful when reviewing data from a past administration.  More specifically, examining item difficulty after each exam administration allows us to see possible changes in the difficulty of an item. For example, let’s say item 3469 has historically performed well and on average examinees have answered it correctly 70% of the time.  However, an item analysis has been conducted after the most recent exam administration and now only 20% of examinees are answering the item correctly.  This decrease in difficulty from 70% to 20% indicates to us that something is going on with this item and it should be reviewed by subject matter experts (SMEs) to determine why fewer candidates have suddenly started answering this item correctly.  By flagging this item and using this data to start a discussion on the item, SMEs may be able to determine if changes in content have occurred that may have not existed when the item was first published.  Or perhaps new references have been published that change the correct answer to another correct answer, or the item was mis-keyed.  All of these examples would be reasons to have a psychometrician facilitate a review with SMEs.

Item discrimination or point biserial correlations are an indication of the relationship between examinees who did well on the exam, and whether or not they correctly answered the item.  This, too, can be useful when reviewing data from past exam administrations.  By examining item discrimination values of items used on past examination forms, we are able to determine whether each multiple choice item is actually discriminating between those candidates who understand the knowledge we are assessing on our exams from those who do not.  We want candidates to answer an item correctly because they have the knowledge, not because they were able to deduce the correct answer because of clues we may have given them in the item.  More specifically, when the item discrimination value is low, or more importantly negative, this indicates to us that more examinees who are doing poorly on the exam or failing the exam are actually getting this item correct than those who are doing well on the exam. This is the opposite of what we would expect to see. We would expect to see that those who have the knowledge are able to answer the item correctly, and those who do not have the knowledge are not able to answer the item correctly.  This typically happens for one of two reasons. One, there is something correct in the distractor options that is pulling those who have the knowledge, or two, the distractors are so unlikely or plausible that examinees without the knowledge are able to eliminate and deduce the correct answer from the plausible options provided.  For example, let’s say item 2388 has an item discrimination value of -0.15 after the most recent exam administration.  This negative item discrimination value indicates that of those candidates who took this exam and answered this item, more of them are people who do not perform well on the exam.  We want to see the opposite. We want the people answering the question incorrectly to be people who are overall not doing well on the exam, as that is an indication that those people do not have the knowledge needed to pass this exam.  Therefore, a psychometrician would flag this item for review with SMEs.  Further digging into this item might also show that no examinees are selecting some of the options, say C and D. This could indicate to SMEs that  these options are not plausible in the given context and are therefore helping examinees without the knowledge to figure out what the correct answer could be.

Hand and hand with item difficulty and discrimination are multiple-choice option frequencies. These frequencies are simply the percent of candidates who chose each answer option. This statistic is often used in direct relation with both item difficulty and discrimination. These frequencies allow us to compare all answer options and determine why candidates have chosen that answer option.  For example, in the item discrimination example above it was discussed that a negative item discrimination value of -0.15 was obtained for item 2388 and that it was uncovered that no examinees were selecting answer options C and D.  We were able to determine this because we reviewed the multiple-choice option frequencies.  In addition, we can compare frequencies across answer options. This comparison often leads us to be able to determine why someone might think a distractor option was more correct than the actual keyed correct answer. Perhaps upon further review it can be determined that a frequently chosen distractor option might not be completely correct and there is something in that distractor option that examinees who are doing well on the exam might think is correct.

The examples I have provided are only the beginning of many possible item and test analyses that could be useful in the process of improving exam items. Analyzing examination data and using that information to improve items helps maintain the integrity of examination programs by ensuring that examinations are actually measuring what we intended for them to measure.  However, analyzing data from exam administrations can only be useful when it is coupled with explanations of the analysis that make sense to all stakeholders. An item analysis is only the first step in improving examination items. We must first be able to help stakeholders understand the analysis before they can use the information to make informed decisions.

Categorized in:

Comments are closed here.