Making and Scoring Quality Exams: May 2015

Thursday, May 28, 2015

Validity

According to the Standards (AERA, APA & NCME, 2014), validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” In other words, validity is how well a test is measuring what it is intended to measure. For example, if you are intending to measure students’ knowledge of the Cold War and you give students a test about the Civil War, clearly you cannot make conclusions about the students’ understanding of the Cold War from the test scores. In this case, the test lacks evidence of validity.

The formal definition in the Standards is a bit long because assessment professionals think that it is misleading to say that a test is valid. First, the test is simply a vehicle for helping you to understand what is inside your students’ heads. It is not really the test that is of interest, but rather the test scores. In fact, it is not even the test scores themselves that you should be concerned with, but rather the meaning underlying those scores. That is why the formal definition refers to interpretations of test scores. Second, it is not responsible just to declare that something is valid. It is more scientific to consider evidence. So when assessment professionals talk about validity, they discuss the quantity and quality of evidence that supports interpreting test scores for an intended purpose.

Regardless of these nuances in the definition of validity, the point is that validity is a critical factor when determining whether or not a test is of good quality, and this concept applies to any test and to any score user, regardless of whether that person is an assessment professional or classroom instructor.

Construct Validity
There are different kinds of validity evidence. One kind that is relevant for academic exams is how well you are measuring the construct you are trying to measure. A construct is a mental concept or characteristic, such as achievement of basic chemistry. Unfortunately, a construct is not something physical that you can measure with a ruler or weigh on a scale. Instead, you have to probe what is inside your students’ heads indirectly by having them answer questions, write essays, or perform tasks. What becomes tricky is determining whether or not the things you are asking your students to do for an exam will actually be measuring the construct of interest. For the most part, you will be interested in measuring achievement of skills and knowledge in a specific content domain – that’s your construct. Common ways of measuring academic achievement are selected-response tests (e.g. multiple choice) and essays. These approaches are widely accepted, so you can rest assured that partaking in the tradition of administering these types of exams affords you some evidence of construct validity.

However, problems can arise when a test does not sufficiently measure a construct. For example, if a test is intended to measure driving ability but includes only a written component, then the test is not measuring the individual’s skill of actually driving a moving vehicle and suffers from construct underrepresentation. The content of the topics covered in a driving class may be well represented on the written test, but the written test by itself will still lack evidence of construct validity without a performance component. In some cases, a multiple-choice exam may not be able to show what a student knows and is capable of as well as an essay. Selection of the style of test influences evidence of construct validity.

Content Validity
Another kind of validity evidence is based on test content. Is your test covering the right material? Students should be given the opportunity to learn the content covered on a test before the test is given. Many students challenge exams that do not reflect what was covered in class or what the professor conveyed were the most important concepts. These exams lack evidence of content validity, which is a legitimate complaint. Fortunately it is a problem that can be avoided by ensuring that the exam covers the appropriate content.

Thankfully, you can make good decisions during the test development process that can help you to avoid threats to both content and construct validity.

Sensitivity
The quality of the items in a test impacts validity as well. If the exam is riddled with ambiguous questions that high-performing students get wrong and low-performing students get right, then the test is measuring something different than the material covered in the course and will be lacking evidence of validity.

The test needs to be able to correctly identify those students who have mastered your content and distinguish them from those students who have not. What would be the point of testing if the test cannot tell you who should get an A and who should get an F? In the medical field, this concept of measuring the degree to which a test can correctly identify something of interest is called sensitivity. Of course, the term is mostly used for tests that can correctly detect a disease or medical condition. But the concept extends to educational testing as well. Which students should pass, and which students should not?

In sum, a test that has sufficient evidence of validity will cover the appropriate content; it will do so in a relevant way; and it will be able to discriminate students who have mastered the content from those who have not.

Wednesday, May 20, 2015

What Makes a Test Good?

There are three essential factors that contribute to how good or bad a test is. These factors apply to all tests regardless of whether they are large-scale standardized tests or small custom tests created for a course. The three factors of good quality are validity, reliability, and fairness.

How do we know that these are the factors that are important? Well, three professional organizations came together to define formal criteria for developing tests and testing practices (among other things). The three organizations were the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). These sponsoring organizations published a book entitled Standards for Educational and Psychological Testing (2014), referred to as the Standards. In the Standards, the three concepts that are defined as foundational for testing are validity, reliability, and fairness.

Although the authors fully acknowledge that many of the practices described in the Standards are not feasible in the classroom because, after all, instructors are not publishing their tests for public use; however, they do state that “the core expectations of validity, reliability/precision, and fairness should be considered in the development of such tests” (p. 183).

In academics, usually the main purpose of giving an exam is to determine a student’s level of mastery of the material and skills covered in a course. Although tests can be used for many other reasons (licensure, placement, progress monitoring), the focus here will be on measuring content mastery, with the ultimate goal of helping to assign grades.

Therefore, with the goal of content mastery in mind, the requirements of a test are quite simple. First, the test needs to include relevant material and skills from a course (validity). Second, it needs to be able to discriminate between those students who have acquired knowledge, skills, and abilities from those students who have not (validity). Third, it needs to be consistent and not award different grades to two people with the same knowledge (reliability). Finally, it needs to be equally accessible to all students enrolled in the course and free from bias (fairness). If a test meets these requirements, then an instructor has evidence that the test is good and defensible.

Purpose of Blog

The purpose of this blog is to help instructors and teaching assistants at academic institutions to create quality exams. This is part of a larger project, which will eventually include a book. Given that it will take quite awhile to actually publish a book, I wanted to disseminate this information sooner rather than later because the benefits of creating quality exams are significant and straightforward. With better exams, instructors can more accurately assess student knowledge, reduce student complaints, and avoid lawsuits.

A quick overview of how to develop effective tests is needed because professors and TAs are hired or admitted into their departments because of their expertise in their field and are simply never exposed to testing theory and its applications. I certainly never was, and yet as a teaching assistant, I was given the responsibility of creating exam questions and scoring exams for hundreds of students. I certainly could have benefited from a bit of knowledge about testing at that time.

Now, more than a decade later, after receiving postdoctorate training in tests and measurements and working in the field of assessment for over 10 years, my goal is to transfer my knowledge of testing to enable anyone in an academic environment responsible for assessing students to create the best exams possible.

Granted, there are many books published on the topic. However, most books on the subject are written as lengthy textbooks costing hundreds of dollars. There are also many websites that discuss the creation of exams, but like many of the textbooks, the websites often include prescriptive advice without ever explaining why the advice is sound. In contrast, the aim of this blog is to introduce busy academics to the guidelines but also to convey why the guidelines are worth following. Thanks for reading.