Saturday, April 8, 2017

Monday, November 30, 2015

What Makes a Test Item Good?

When I was a graduate student grading statistics tests, I learned that the software we were using produced some interesting statistics about each item. For example, it could tell you what percentage of students got an answer right and what proportion of students selected each choice for each item. Without having any formal training in Testing Theory, it was a mystery to me what this information meant at a practical level. There was one question, for example, that only 4% of the students got correct. Did this mean that this was a bad item? How could you tell if the item was a bad item or whether it was simply difficult? For another question, the majority of the students selected one of the distractors. Was this a good item? Or was it just tricky? Now, after some training, I realize that Testing Theory has the answers.

When trying to determine if an item is good or bad, the main deciding factor is the item’s ability to discriminate students who have mastered the material from those students who have not. Simply put, bad items do not have the ability to discriminate. The key is to look at who is getting the item correct and who is getting it wrong. If only the high-performing students are getting the answer right, then the item is simply difficult. On the other hand, if mid- and low-performing students are also getting the answer correct or if they are the only ones getting the item correct and the high-performing students are getting it wrong, then there is something wrong with the item. With these patterns, the item is not doing what it is intended to do, and that is measure mastery of the content.

So how do you figure out who is getting the item correct? One of the most accepted ways of evaluating an item is to calculate a correlation. The technical term for the kind of correlation you are looking for (or calculating) is a point biserial. The term point biserial is the name for a special kind of correlation. Instead of correlating two sets of things that are on a continuous scale, you are simply correlating one thing that is on a continuous scale and one thing that has only two possible values: correct or incorrect. At a high level, what you are doing is correlating a response on a single question with the student’s overall test score. The overall test score gives you an indication of whether the student is high-performing or low-performing. If an item is functioning well, then it will correlate with the overall test scores of your students.

Often scoring services such as GradeHub will provide point biserials for you.

A high point-biserial reflects the fact that the item is doing a good job of discriminating your high-performing students from your low-performing students. Values for point biserials can range from -1.00 to 1.00. Values of 0.15 or higher mean that the item is performing well. Generally the students with high scores are answering it correctly and the students with low scores are getting it wrong. Point biserials around zero indicate that there is no clear pattern. For example, some high-performing students may be getting the item right, but so are some low-performing students.

Point biserials that are negative signify a big problem. With this pattern, the high-performing students are getting the answer wrong and the low- and/or mid-performing students are getting it right. This pattern is the complete opposite of what makes an item good. Researchers have recommended removing items that have a negative point-biserial (Kaplan & Saccuzzo, 2013) or even a point biserial less than 0.15 (Varma, 2006). Clearly, deleting half the items on your test because of negative point-biserials is not a desirable outcome, but awareness of item discrimination issues can help you make some tough decisions regarding what cut-off you want to choose and can help you to improve future tests.

References
Kaplan, R. M. & Saccuzzo, D. P. (2013). Psychological testing: Principles, applications and issues (8th ed.). Belmont CA: Cengage.

Varma, S. (2006). Preliminary item statistics using point-biserial correlation and p-values. Educational Data Systems Inc.: Morgan Hill CA.

Research Does Not Support Multiple Choice Items with "All of the Above"

Although writing items that include a choice of “all of the above” is a common practice, research has suggested that this type of multiple choice item should be avoided. Here is why: the correct answer for this type of item is so often “all of the above,” that students have an increased probability of selecting the correct answer because of good test-taking abilities as opposed to knowledge of the material on the test. Students can select “all of the above” without even reading the question and have a high chance of getting the answer right (Mueller, 1975; Poundstone, 2014).

Even if the answer is not “all of the above,” the student need only find one choice that is not true to eliminate “all of the above” as the correct option.

There is also some evidence that items with “all of the above” have a decreased ability to discriminate high-performing students from low-performing students (Mueller, 1975). In a review of 20 textbooks listing advice about item writing, avoiding “all of the above” was the most frequently mentioned piece of advice (Kaplan & Saccuzzo, 2013).

References
Kaplan, R. M. & Saccuzzo, D. P. (2013). Psychological testing: Principles, applications and issues (8th ed.). Belmont CA: Cengage.

Mueller, D.J.(1975). An assessment of the effectiveness of complex alternatives in multiple choice achievement test items. Educational and Psychological Measurement, 35, 135-141.

Poundstone, W. (2014). Rock breaks scissors: A practical guide to outguessing and outwitting almost everybody. New York: Little, Brown and Company.

Tuesday, June 2, 2015

Fairness

The third factor related to the quality of test scores is fairness. If test items or a test in general are biased against (or for) a person or subgroup and this difference is not related to what is being tested, then some may consider the test to be unfair. The technical term for this is construct-irrelevant variance, meaning there is extra noise in the test scores caused by something other than the skills and knowledge you are trying to measure.

For example, a test may include items that reference topics that are taboo for a specific subgroup of test takers and may contribute to missed correct answers. In this case, the test has introduced construct-irrelevant variance.

Another example is when items include unduly complicated language beyond technical vocabulary needed for content mastery. In this case, the test is not only measuring knowledge of the content domain but also proficiency of the English language. In some cases, these kinds of items might be biased against non-native speakers.

Fairness also includes providing reasonable accommodations for students with documented disabilities. Usually, guidelines provided by your institution can help you navigate these situations, but general awareness of the topic can help you know what to expect.

Monday, June 1, 2015

Reliability

One indicator of test quality is reliability. Reliability is how consistent test scores are. Ideally, if a student gets an A on your exam on Monday, then the student should get an A on the test on Tuesday (assuming that the student’s knowledge of the material remains constant). Granted, there are a variety of reasons why the student might not get an A on Tuesday unrelated to the student’s knowledge of the material, but instead for reasons that have to do with the student’s internal mental state, such as not getting enough sleep or being distracted by an upcoming vacation. But assuming that the student’s knowledge of the material is constant AND the student is in an identical mental state on both days, scores could still be different due to measurement error inherent in the test itself. Some tests are just more reliable than others.

It is well known that the length of a test contributes to reliability. A test with only a few items can gather only a small amount of information regarding what the student does and does not know. A test with more items or questions allows you to gather more information and therefore will most likely be more reliable. In fact, there is a mathematical relation between the number of items a test has and the reliability of that test (Brown, 1910; Cronbach, 1951; Spearman, 1910). A similar concept applies to the number of elements that are used to determine a final grade (such as grades from papers, scores on tests, completed assignments). Centers of teaching and learning at universities encourage instructors to consider enough elements to ensure that a final grade has a high degree of precision (http://cte.illinois.edu/testing/exam/course_grades.html).

Another factor influencing reliability is consistency in scoring. This issue arises most when scoring essays in which the amount of credit given depends on the scorer’s subjective opinion. The scorer could be inconsistent and assign a paper an A one day and the same paper a B the next day. Also, if multiple people are scoring a test, two graders could potentially provide different scores for the same exact response. All these differences within individual graders and between graders degrade the reliability of a test.

There is actually a significant amount of interplay between reliability and validity. Basically, no matter how much work you invest in ensuring that the content of the test matches your content domain and that the tasks are measuring the construct of interest, if your test cannot provide consistent scores, then your test will lack evidence of validity. You cannot have a valid test with low reliability. In another scenario, you could have a test with extremely high reliability, but this does not guarantee that the test is valid. You need evidence of both validity and high reliability for a test to be considered good quality.

In order to create an exam that is reliable, then, the challenge is to create an exam with enough items to provide a sufficient amount of information about the student’s mastery of the content and to ensure consistency in scoring.

Thursday, May 28, 2015

Validity

According to the Standards (AERA, APA & NCME, 2014), validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” In other words, validity is how well a test is measuring what it is intended to measure. For example, if you are intending to measure students’ knowledge of the Cold War and you give students a test about the Civil War, clearly you cannot make conclusions about the students’ understanding of the Cold War from the test scores. In this case, the test lacks evidence of validity.

The formal definition in the Standards is a bit long because assessment professionals think that it is misleading to say that a test is valid. First, the test is simply a vehicle for helping you to understand what is inside your students’ heads. It is not really the test that is of interest, but rather the test scores. In fact, it is not even the test scores themselves that you should be concerned with, but rather the meaning underlying those scores. That is why the formal definition refers to interpretations of test scores. Second, it is not responsible just to declare that something is valid. It is more scientific to consider evidence. So when assessment professionals talk about validity, they discuss the quantity and quality of evidence that supports interpreting test scores for an intended purpose.

Regardless of these nuances in the definition of validity, the point is that validity is a critical factor when determining whether or not a test is of good quality, and this concept applies to any test and to any score user, regardless of whether that person is an assessment professional or classroom instructor.

Construct Validity
There are different kinds of validity evidence. One kind that is relevant for academic exams is how well you are measuring the construct you are trying to measure. A construct is a mental concept or characteristic, such as achievement of basic chemistry. Unfortunately, a construct is not something physical that you can measure with a ruler or weigh on a scale. Instead, you have to probe what is inside your students’ heads indirectly by having them answer questions, write essays, or perform tasks. What becomes tricky is determining whether or not the things you are asking your students to do for an exam will actually be measuring the construct of interest. For the most part, you will be interested in measuring achievement of skills and knowledge in a specific content domain – that’s your construct. Common ways of measuring academic achievement are selected-response tests (e.g. multiple choice) and essays. These approaches are widely accepted, so you can rest assured that partaking in the tradition of administering these types of exams affords you some evidence of construct validity.

However, problems can arise when a test does not sufficiently measure a construct. For example, if a test is intended to measure driving ability but includes only a written component, then the test is not measuring the individual’s skill of actually driving a moving vehicle and suffers from construct underrepresentation. The content of the topics covered in a driving class may be well represented on the written test, but the written test by itself will still lack evidence of construct validity without a performance component. In some cases, a multiple-choice exam may not be able to show what a student knows and is capable of as well as an essay. Selection of the style of test influences evidence of construct validity.

Content Validity
Another kind of validity evidence is based on test content. Is your test covering the right material? Students should be given the opportunity to learn the content covered on a test before the test is given. Many students challenge exams that do not reflect what was covered in class or what the professor conveyed were the most important concepts. These exams lack evidence of content validity, which is a legitimate complaint. Fortunately it is a problem that can be avoided by ensuring that the exam covers the appropriate content.

Thankfully, you can make good decisions during the test development process that can help you to avoid threats to both content and construct validity.

Sensitivity
The quality of the items in a test impacts validity as well. If the exam is riddled with ambiguous questions that high-performing students get wrong and low-performing students get right, then the test is measuring something different than the material covered in the course and will be lacking evidence of validity.

The test needs to be able to correctly identify those students who have mastered your content and distinguish them from those students who have not. What would be the point of testing if the test cannot tell you who should get an A and who should get an F? In the medical field, this concept of measuring the degree to which a test can correctly identify something of interest is called sensitivity. Of course, the term is mostly used for tests that can correctly detect a disease or medical condition. But the concept extends to educational testing as well. Which students should pass, and which students should not?

In sum, a test that has sufficient evidence of validity will cover the appropriate content; it will do so in a relevant way; and it will be able to discriminate students who have mastered the content from those who have not.

Wednesday, May 20, 2015

What Makes a Test Good?

There are three essential factors that contribute to how good or bad a test is. These factors apply to all tests regardless of whether they are large-scale standardized tests or small custom tests created for a course. The three factors of good quality are validity, reliability, and fairness.

How do we know that these are the factors that are important? Well, three professional organizations came together to define formal criteria for developing tests and testing practices (among other things). The three organizations were the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). These sponsoring organizations published a book entitled Standards for Educational and Psychological Testing (2014), referred to as the Standards. In the Standards, the three concepts that are defined as foundational for testing are validity, reliability, and fairness.

Although the authors fully acknowledge that many of the practices described in the Standards are not feasible in the classroom because, after all, instructors are not publishing their tests for public use; however, they do state that “the core expectations of validity, reliability/precision, and fairness should be considered in the development of such tests” (p. 183).

In academics, usually the main purpose of giving an exam is to determine a student’s level of mastery of the material and skills covered in a course. Although tests can be used for many other reasons (licensure, placement, progress monitoring), the focus here will be on measuring content mastery, with the ultimate goal of helping to assign grades.

Therefore, with the goal of content mastery in mind, the requirements of a test are quite simple. First, the test needs to include relevant material and skills from a course (validity). Second, it needs to be able to discriminate between those students who have acquired knowledge, skills, and abilities from those students who have not (validity). Third, it needs to be consistent and not award different grades to two people with the same knowledge (reliability). Finally, it needs to be equally accessible to all students enrolled in the course and free from bias (fairness). If a test meets these requirements, then an instructor has evidence that the test is good and defensible.