Making and Scoring Quality Exams: November 2015

When I was a graduate student grading statistics tests, I learned that the software we were using produced some interesting statistics about each item. For example, it could tell you what percentage of students got an answer right and what proportion of students selected each choice for each item. Without having any formal training in Testing Theory, it was a mystery to me what this information meant at a practical level. There was one question, for example, that only 4% of the students got correct. Did this mean that this was a bad item? How could you tell if the item was a bad item or whether it was simply difficult? For another question, the majority of the students selected one of the distractors. Was this a good item? Or was it just tricky? Now, after some training, I realize that Testing Theory has the answers.

When trying to determine if an item is good or bad, the main deciding factor is the item’s ability to discriminate students who have mastered the material from those students who have not. Simply put, bad items do not have the ability to discriminate. The key is to look at who is getting the item correct and who is getting it wrong. If only the high-performing students are getting the answer right, then the item is simply difficult. On the other hand, if mid- and low-performing students are also getting the answer correct or if they are the only ones getting the item correct and the high-performing students are getting it wrong, then there is something wrong with the item. With these patterns, the item is not doing what it is intended to do, and that is measure mastery of the content.

So how do you figure out who is getting the item correct? One of the most accepted ways of evaluating an item is to calculate a correlation. The technical term for the kind of correlation you are looking for (or calculating) is a point biserial. The term point biserial is the name for a special kind of correlation. Instead of correlating two sets of things that are on a continuous scale, you are simply correlating one thing that is on a continuous scale and one thing that has only two possible values: correct or incorrect. At a high level, what you are doing is correlating a response on a single question with the student’s overall test score. The overall test score gives you an indication of whether the student is high-performing or low-performing. If an item is functioning well, then it will correlate with the overall test scores of your students.

Often scoring services such as GradeHub will provide point biserials for you.

A high point-biserial reflects the fact that the item is doing a good job of discriminating your high-performing students from your low-performing students. Values for point biserials can range from -1.00 to 1.00. Values of 0.15 or higher mean that the item is performing well. Generally the students with high scores are answering it correctly and the students with low scores are getting it wrong. Point biserials around zero indicate that there is no clear pattern. For example, some high-performing students may be getting the item right, but so are some low-performing students.

Point biserials that are negative signify a big problem. With this pattern, the high-performing students are getting the answer wrong and the low- and/or mid-performing students are getting it right. This pattern is the complete opposite of what makes an item good. Researchers have recommended removing items that have a negative point-biserial (Kaplan & Saccuzzo, 2013) or even a point biserial less than 0.15 (Varma, 2006). Clearly, deleting half the items on your test because of negative point-biserials is not a desirable outcome, but awareness of item discrimination issues can help you make some tough decisions regarding what cut-off you want to choose and can help you to improve future tests.

References
Kaplan, R. M. & Saccuzzo, D. P. (2013). Psychological testing: Principles, applications and issues (8th ed.). Belmont CA: Cengage.

Varma, S. (2006). Preliminary item statistics using point-biserial correlation and p-values. Educational Data Systems Inc.: Morgan Hill CA.

Although writing items that include a choice of “all of the above” is a common practice, research has suggested that this type of multiple choice item should be avoided. Here is why: the correct answer for this type of item is so often “all of the above,” that students have an increased probability of selecting the correct answer because of good test-taking abilities as opposed to knowledge of the material on the test. Students can select “all of the above” without even reading the question and have a high chance of getting the answer right (Mueller, 1975; Poundstone, 2014).

Even if the answer is not “all of the above,” the student need only find one choice that is not true to eliminate “all of the above” as the correct option.

There is also some evidence that items with “all of the above” have a decreased ability to discriminate high-performing students from low-performing students (Mueller, 1975). In a review of 20 textbooks listing advice about item writing, avoiding “all of the above” was the most frequently mentioned piece of advice (Kaplan & Saccuzzo, 2013).

References
Kaplan, R. M. & Saccuzzo, D. P. (2013). Psychological testing: Principles, applications and issues (8th ed.). Belmont CA: Cengage.

Mueller, D.J.(1975). An assessment of the effectiveness of complex alternatives in multiple choice achievement test items. Educational and Psychological Measurement, 35, 135-141.

Poundstone, W. (2014). Rock breaks scissors: A practical guide to outguessing and outwitting almost everybody. New York: Little, Brown and Company.

Making and Scoring Quality Exams

Monday, November 30, 2015

What Makes a Test Item Good?

Research Does Not Support Multiple Choice Items with "All of the Above"