Making and Scoring Quality Exams: June 2015

The third factor related to the quality of test scores is fairness. If test items or a test in general are biased against (or for) a person or subgroup and this difference is not related to what is being tested, then some may consider the test to be unfair. The technical term for this is construct-irrelevant variance, meaning there is extra noise in the test scores caused by something other than the skills and knowledge you are trying to measure.

For example, a test may include items that reference topics that are taboo for a specific subgroup of test takers and may contribute to missed correct answers. In this case, the test has introduced construct-irrelevant variance.

Another example is when items include unduly complicated language beyond technical vocabulary needed for content mastery. In this case, the test is not only measuring knowledge of the content domain but also proficiency of the English language. In some cases, these kinds of items might be biased against non-native speakers.

Fairness also includes providing reasonable accommodations for students with documented disabilities. Usually, guidelines provided by your institution can help you navigate these situations, but general awareness of the topic can help you know what to expect.

One indicator of test quality is reliability. Reliability is how consistent test scores are. Ideally, if a student gets an A on your exam on Monday, then the student should get an A on the test on Tuesday (assuming that the student’s knowledge of the material remains constant). Granted, there are a variety of reasons why the student might not get an A on Tuesday unrelated to the student’s knowledge of the material, but instead for reasons that have to do with the student’s internal mental state, such as not getting enough sleep or being distracted by an upcoming vacation. But assuming that the student’s knowledge of the material is constant AND the student is in an identical mental state on both days, scores could still be different due to measurement error inherent in the test itself. Some tests are just more reliable than others.

It is well known that the length of a test contributes to reliability. A test with only a few items can gather only a small amount of information regarding what the student does and does not know. A test with more items or questions allows you to gather more information and therefore will most likely be more reliable. In fact, there is a mathematical relation between the number of items a test has and the reliability of that test (Brown, 1910; Cronbach, 1951; Spearman, 1910). A similar concept applies to the number of elements that are used to determine a final grade (such as grades from papers, scores on tests, completed assignments). Centers of teaching and learning at universities encourage instructors to consider enough elements to ensure that a final grade has a high degree of precision (http://cte.illinois.edu/testing/exam/course_grades.html).

Another factor influencing reliability is consistency in scoring. This issue arises most when scoring essays in which the amount of credit given depends on the scorer’s subjective opinion. The scorer could be inconsistent and assign a paper an A one day and the same paper a B the next day. Also, if multiple people are scoring a test, two graders could potentially provide different scores for the same exact response. All these differences within individual graders and between graders degrade the reliability of a test.

There is actually a significant amount of interplay between reliability and validity. Basically, no matter how much work you invest in ensuring that the content of the test matches your content domain and that the tasks are measuring the construct of interest, if your test cannot provide consistent scores, then your test will lack evidence of validity. You cannot have a valid test with low reliability. In another scenario, you could have a test with extremely high reliability, but this does not guarantee that the test is valid. You need evidence of both validity and high reliability for a test to be considered good quality.

In order to create an exam that is reliable, then, the challenge is to create an exam with enough items to provide a sufficient amount of information about the student’s mastery of the content and to ensure consistency in scoring.

Making and Scoring Quality Exams

Tuesday, June 2, 2015

Fairness

Monday, June 1, 2015

Reliability