One indicator of test quality is reliability. Reliability is how consistent test scores are. Ideally, if a student gets an A on your exam on Monday, then the student should get an A on the test on Tuesday (assuming that the student’s knowledge of the material remains constant). Granted, there are a variety of reasons why the student might not get an A on Tuesday unrelated to the student’s knowledge of the material, but instead for reasons that have to do with the student’s internal mental state, such as not getting enough sleep or being distracted by an upcoming vacation. But assuming that the student’s knowledge of the material is constant AND the student is in an identical mental state on both days, scores could still be different due to measurement error inherent in the test itself. Some tests are just more reliable than others.
It is well known that the length of a test contributes to reliability. A test with only a few items can gather only a small amount of information regarding what the student does and does not know. A test with more items or questions allows you to gather more information and therefore will most likely be more reliable. In fact, there is a mathematical relation between the number of items a test has and the reliability of that test (Brown, 1910; Cronbach, 1951; Spearman, 1910). A similar concept applies to the number of elements that are used to determine a final grade (such as grades from papers, scores on tests, completed assignments). Centers of teaching and learning at universities encourage instructors to consider enough elements to ensure that a final grade has a high degree of precision (http://cte.illinois.edu/testing/exam/course_grades.html).
Another factor influencing reliability is consistency in scoring. This issue arises most when scoring essays in which the amount of credit given depends on the scorer’s subjective opinion. The scorer could be inconsistent and assign a paper an A one day and the same paper a B the next day. Also, if multiple people are scoring a test, two graders could potentially provide different scores for the same exact response. All these differences within individual graders and between graders degrade the reliability of a test.
There is actually a significant amount of interplay between reliability and validity. Basically, no matter how much work you invest in ensuring that the content of the test matches your content domain and that the tasks are measuring the construct of interest, if your test cannot provide consistent scores, then your test will lack evidence of validity. You cannot have a valid test with low reliability. In another scenario, you could have a test with extremely high reliability, but this does not guarantee that the test is valid. You need evidence of both validity and high reliability for a test to be considered good quality.
In order to create an exam that is reliable, then, the challenge is to create an exam with enough items to provide a sufficient amount of information about the student’s mastery of the content and to ensure consistency in scoring.