The formal definition in the Standards is a bit long because assessment professionals think that it is misleading to say that a test is valid. First, the test is simply a vehicle for helping you to understand what is inside your students’ heads. It is not really the test that is of interest, but rather the test scores. In fact, it is not even the test scores themselves that you should be concerned with, but rather the meaning underlying those scores. That is why the formal definition refers to interpretations of test scores. Second, it is not responsible just to declare that something is valid. It is more scientific to consider evidence. So when assessment professionals talk about validity, they discuss the quantity and quality of evidence that supports interpreting test scores for an intended purpose.
Regardless of these nuances in the definition of validity, the point is that validity is a critical factor when determining whether or not a test is of good quality, and this concept applies to any test and to any score user, regardless of whether that person is an assessment professional or classroom instructor.
Construct Validity
There are different kinds of validity evidence. One kind that is relevant for academic exams is how well you are measuring the construct you are trying to measure. A construct is a mental concept or characteristic, such as achievement of basic chemistry. Unfortunately, a construct is not something physical that you can measure with a ruler or weigh on a scale. Instead, you have to probe what is inside your students’ heads indirectly by having them answer questions, write essays, or perform tasks. What becomes tricky is determining whether or not the things you are asking your students to do for an exam will actually be measuring the construct of interest. For the most part, you will be interested in measuring achievement of skills and knowledge in a specific content domain – that’s your construct. Common ways of measuring academic achievement are selected-response tests (e.g. multiple choice) and essays. These approaches are widely accepted, so you can rest assured that partaking in the tradition of administering these types of exams affords you some evidence of construct validity.
However, problems can arise when a test does not sufficiently measure a construct. For example, if a test is intended to measure driving ability but includes only a written component, then the test is not measuring the individual’s skill of actually driving a moving vehicle and suffers from construct underrepresentation. The content of the topics covered in a driving class may be well represented on the written test, but the written test by itself will still lack evidence of construct validity without a performance component. In some cases, a multiple-choice exam may not be able to show what a student knows and is capable of as well as an essay. Selection of the style of test influences evidence of construct validity.
Content Validity
Another kind of validity evidence is based on test content. Is your test covering the right material? Students should be given the opportunity to learn the content covered on a test before the test is given. Many students challenge exams that do not reflect what was covered in class or what the professor conveyed were the most important concepts. These exams lack evidence of content validity, which is a legitimate complaint. Fortunately it is a problem that can be avoided by ensuring that the exam covers the appropriate content.
Thankfully, you can make good decisions during the test development process that can help you to avoid threats to both content and construct validity.
Sensitivity
The quality of the items in a test impacts validity as well. If the exam is riddled with ambiguous questions that high-performing students get wrong and low-performing students get right, then the test is measuring something different than the material covered in the course and will be lacking evidence of validity.
The test needs to be able to correctly identify those students who have mastered your content and distinguish them from those students who have not. What would be the point of testing if the test cannot tell you who should get an A and who should get an F? In the medical field, this concept of measuring the degree to which a test can correctly identify something of interest is called sensitivity. Of course, the term is mostly used for tests that can correctly detect a disease or medical condition. But the concept extends to educational testing as well. Which students should pass, and which students should not?
In sum, a test that has sufficient evidence of validity will cover the appropriate content; it will do so in a relevant way; and it will be able to discriminate students who have mastered the content from those who have not.