By Duane Swacker
Error Concerns in Educational Assessment
The study of error is not only in the highest degree prophylactic, but it serves as a stimulating introduction to the study of truth. Walter Lippmann
Wilson notes “To estimate error is to imply what is without error; and what is without error is determined by what we define as true, by the assumptions of the frame of reference that forms our epistemological base.” In other words, depending upon the point of view of the assessment frame (as described above) there are different sorts of errors that plague assessment accuracy and validity. Not only that but also when we confuse and conflate assessment frames, which practically speaking is guaranteed to happen, we compound the errors and thus compromise the accuracy and validity of any assessment of student learning and work. Wilson points out thirteen sources of error (there are more) in the process of making, using and disseminating the results of standardized testing (and its precursor educational standards), any one of which renders any results and conclusions drawn from the tests invalid. Let’s list and discuss those sources and then examine each frame of assessment in relation to some of the errors and the resulting consequences from which the students suffer and the accompanying harms to the students in relation to the fundamental purpose of education.
The thirteen types of errors are (Wilson’s descriptions in italics):
1. Temporal errors: “Practically, temporal errors are indicated by the differences in assessment description when the assessment occurs at different times.” In other words different scores obtained for test takers on the same or psychometrically similar tests taken at different times constitute a form of error. In psychometrics this is a reliability issue.
2. Contextual errors: “Practically, contextual errors include all those differences in performance and its assessment that occur when the context of the assessment event changes.” The context changes often, such as one student sitting in a hard old wooden desk to take the test with no air conditioning on a 95 degree day, window open and all the noises of the urban environment pouring in versus the student sitting in comfortable chair, carpeted flooring to reduce noise, in a closed silent air conditioned room. Any differences in scores that might accrue from any number of different contextual differences are considered errors.
3. Construction errors: “Practically, construction errors are indicated by all those differences in assessment description when the same construct is assessed independently by different people in different ways.” In psychometrics this would be another reliability issue as different types of assessments may yield different results, think of the difference of student performance on the same test taken by computer test versus a pen and pencil test both of which might even be considered psychometrically reliable.
4. Labelling errors: “Practically, labelling errors are indicated by the range of meanings given to the label by all those who use it before, during or after the assessment event.” Think of end of term grading. A student might consider a “D” grade a perfectly fine grade whereas his teacher and probably parents would not agree and consider it a less than acceptable one.
5. Attachment errors: “Practically, attachment errors are indicated by the specification of those elements and boundaries of the assessment event that have become lost in the assessment description.” Think of the girl in the introduction who proudly proclaimed “I’m an ‘A’ student.” The ‘A’ is a description of her prior interaction with the subject matter/curriculum not of her. We cannot logically “attach” the label to the subject matter/curriculum nor can we “attach” it to the student. The label is of the interactions that she had with the assessment. Attachment errors are some of the most egregious as the student internalizes those labels, many times negative and damning, through internalization or subjectivization.
6. Frame of reference errors: “Practically, frame of reference errors are indicated by specifying the frame in which the assessment is supposedly based, and indicating any slides or confusions that occur during the assessment events.” The different assumptions about assessment made in the four frames logically do not allow for sliding between various frames in assessing student learning. Think of having the prospective driver take a multiple choice test about driving and including the result/score in with the results of the actual driving part of a driver certification test.
7. Instrument errors: “Practically, many aspects of instrument error are covered by other category errors. To avoid unnecessary overlap, I will limit the practical indicator of instrumental error to those errors implicit in the construction of the measuring instrument itself; what is conventionally called standard error of the estimate.” In other words, what is supposed to be a guarantor of quality that allows us to know how much error in measurement a test has so that hopefully mis-categorization of student results is not obtained becomes an argument for the supposed validity of the results. The standard error of estimate might best be considered a psychometric fudge to even out result discrepancies and is hardly ever published, certainly not for the test taker.
8. Categorization errors: “Practically, categorisation errors are all those differences in assessment description that occur when particular data is compared with a particular standard to produce a categorisation of the assessed person. Categorisation errors derive from confusions about the definition of standard of acceptability, from differences in the meaning of what is being assessed and in the magnitude of its measurement, and in the variability of the judgment process in which the comparison with the standard is made.” Categorization errors get to the heart of the invalidities involved with attempting to assess student responses in relation to an educational standard. Since the making of educational standards have not followed established protocol by standards organization, the individual standard itself is invalid and any attempts to use it as a basis for assessing student responses is invalid. Standards imply a unit of measure that can be measured with more or less accuracy and there is no basic unit of student learning, to attempt to invent one out of thin air and use it as a basis of “measuring” student learning is invalid. The abuse of the usage of the meanings of standards and measurement by the psychometric community is unethical and will be further discussed in Chapter 6.
9. Comparability errors: “Practically, comparability errors are indicated by constructing different aggregates according to the competing models. The differences that these produce indicate the comparability error. Comparability errors include all those confusions about meaning and privileging that inhabit the addition of test scores, grades or criteria related statements.” Again this hints at psychometric reliability issues in the comparability of test scores from two different tests that supposedly measure the same construct. Not only that but since tests almost always assess multiple constructs, which questions of which constructs get priority not only in placement on the test but in points assigned indicate comparability errors and greatly affect the validity of the results. Think of two different Spanish assessment that covers the same grammar and vocabulary constructs in different ways, perhaps a written and an oral test. Which construct take precedence in each test and which section garners the lion’s share of points?
10. Prediction errors: “Explicit or implicit in most assessments is the claim that they relate to some future performance, that they predict a particular product from some future event, a quality of some future action. Practically, prediction error is indicated by the differences between what is predicted by the assessment data, and what is later assessed as the case in the predicted event.” SAT or ACT as a predictor of college success indicate prediction error due to minimal correlation.
11. Logical type errors: “Logical type errors occur whenever there is confusion between statements about a class of events, and statements about individual items of that class. Practically, logical type errors are made explicit when the explicit and implicit truth claims of a particular assessment are examined and any logical type errors are made explicit. Such exposure may invalidate such claims.” Many mistake the overall score to mean much more than it is, a conglomeration of assessments of multiple constructs that in essence says nothing about the students learning of each construct. To say that a student scored 73% on a test of German of Chapter 1 vocabulary and grammar says nothing of what the student learned. It’s just a statement on the percentage of answers correct not on a level of student’s learning and knowledge.
12. Value errors: “Practically, value errors are indicated by making explicit the value positions explicit or implicit in the various phases of the assessment event, including its consequences, and specifying any contradiction or confusion (difference) that is evident.” Value errors are those that privilege certain values in the teaching and learning process. Think of the Native Americans in the past who were forced to change names, learn English and study common school subjects instead of being allowed to live and learn in their own culture. Less obvious, perhaps is the current push in some states to eliminate bi-lingual education.
13. Consequential errors: “Practically, at a simplistic level, consequential errors are indicated by the differential positive and negative effects that individual teachers and students attribute to the assessment process: At a more profound level it involves an explication of the very construction of their individuality, and all of the potentially violating consequences of those constructions.” How might student learning change when the teacher uses many weeks for standardized test preparation in order that the students may garner high scores instead of actually focusing on curriculum, allowing the teaching and learning process to continue unabated throughout the year? How will the students internalize that process of learning to pick the correct answer instead of learning subject matter? Consequential type errors are some of the most atrocious and intolerable of errors that results in turning the educational practices of standards and standardized testing into education malpractices.
In the Judge frame error in assessment is the difference between a teacher’s evaluations of the same work at different times—temporal errors. It has been accepted for a long time that evaluator reliability over time in assessing students work is less than adequate. No teacher is infallible in grading student work no matter how much training in rubrics or whatever other grading scheme is used. Course grades as a proxy for percentage of total points earned form another source of error by the act of abridging the complex situation of the classroom teaching and learning process into a single concept, the grade. Much valuable information of student learning is lost in this transition and thus constitutes error in assessment—logical type error. By definition the Judge frame is subjective, which is not necessarily a bad thing, but is when the judge/teacher claims more accuracy than warranted. And most teachers do claim more accuracy than is warranted. In doing so the teacher is being unjust to the students through lacking “fidelity to truth” in the teacher’s claimed expertise. And in being unjust he/she, as part of the state, “fails in its chief design” to “promote the welfare of the individual.“
The psychometric fudge in the General frame attempts to alleviate the error between the “true score” and the “estimated score”. The statistical manipulations of data are just that and have little to no relation to the realities of the teaching and learning process. As Wilson states in the same chapter “Their theoretical elegance has hidden their inapplicability to most practical learning and teaching situations; the mystification of their statistical constructs has hidden from teachers, students and public alike the enormous extent of rank order inaccuracies and grade confusion, and the arbitrary nature of all cutoffs and [supposed] standards.” Even though there are massive epistemological and ontological errors (see the above mentioned thirteen to name just some) throughout the whole process of educational standards and standardized testing the many supporters of those two malpractices insist that the results are objective and accurate.
With Wilson having destroyed the credibility of the assertions of objectivity and accuracy of those in the General camp, one can only surmise that they are not acting with a “fidelity to truth” attitude. Just a cursory look at anecdotal evidence shows consequential and labelling errors and points to serious harms to students involved in this current standardized testing craze—students soiling themselves, students refusing to participate after of few minutes of frustration, students apoplectic about the effects of their scores on their teacher’s evaluation, the negative internalization of the results by students, the discrimination of ranking and sorting rewarding some and punishing others—the list goes on and on. There is no doubt that these standardized testing regimes are a case of the state not “promot(ing) the welfare of the individual . . . and is usurpation and oppression and the state fails in its chief design.”
According to the enthusiast of the Specific frame there really shouldn’t be any error as all of the learning and behavioral objectives are specifically enumerated in advance and the student responses purportely indicate a mastery of the learning or behavioral construct. For example X percentage of correct answers on a multiple choice test means that a test taker is certified to be able to do the construct that is being tested. Think of a written drivers test as part of obtaining a drivers license. One usually has to get a 70% correct answer rate. Does that mean they know all it takes to be and adequate and safe driver? Not necessarily and it is a logical type error to believe so because that written test is not a test of driving but of knowledge of just some of what goes into driving a vehicle. Even during the actual driving test (it too is in the Specific camp) the assessor can evaluate only a small part of what it takes to drive a vehicle in all circumstances. The driving tests hint at a number of Wilson’s errors, comparability being just one-that of determining what the actual construct being tested is (construct validity) and what types and quantity of activities, questions are sufficient (construct representation) to objectively say that the construct has been met. Think of the prediction error in asserting that the young driver is capable of safely driving by awarding him/her the license—”Newly licensed teens: Crash risk is particularly high during the first months of licensure.
Ask a group of ten teachers how to assess a given learning objective or construct and more likely than not you’ll get ten different answers. Choosing questions and correct answers in the Specific frame is not as simple as it first appears—value errors are many. And as it is many tests in the Specific frame end up being tests of minimum competency as that is the easiest fashion in which to design the test. So while some students may be able to answer one set of question satisfactory, they may not be able to answer others that may be more important in assessing the same construct.
Which students are privileged and which ones hindered by that fact? We don’t know but we know it happens so that some are allowed to drive, to move up a grade, to begin a program of study while others aren’t. Considering the amount of error(s) involved in just the construct representation side of the Specific frame, should the state in the form of public schools be discriminating against some through faulty assessing mechanisms? From what we know of the purpose of government in this country and from the fundamental purpose of public education as outlined above the answer has to be an unmitigated NO lest the state “fail in its chief design” and discriminate against some and not “promote the welfare of the individual so that each person may savor the right to life, liberty, the pursuit of happiness, and the fruits of their own industry.”
The Responsive frame has no problem in unearthing any errors of thought, ideas, actions, and responses of the student in his/her learning process. In the Responsive frame assessment is meant to further the learning (and teaching) process in a fashion similar to what is called a formative evaluation these days and it is not meant to be a pronouncement, a judgement of the student as in a summative evaluation.
Wilson says it well: “Within such a frame there is no question of a right judgment, of a correct classification, of a true score. The response might be sensitive or insensitive, sophisticated or ingenuous, informed or uninformed. The verbalisation of that response might be honest or manipulative, its fullness expressed or repressed, its clarity widened or obscured. It still belongs undeniably to the assessor, and the expectation is not towards a conformity of judgment, but a diversity of reaction. The lowest common factor of agreement is replaced by the highest common multiple of difference. The subject of assessment is no longer reduced to an object by the limiting reductionism of a single number, but is expanded by the hopefully helpful feedback of diverse and stimulating and expansive response. . . This frame is, in fact, a necessary part of any educational processes that value diversity and freedom of students, and thus includes this broad equity concept of fairness and justice.” Here we have the ideal frame to fulfill the purpose of public education “to promote the welfare of the individual so that each person may savor the right to life, liberty, the pursuit of happiness, and the fruits of their own industry.”
As cogently stated over 2,000 years ago by Marcus Tullius Cicero “Any man can make mistakes, but only an idiot persists in his error.” Considering that Noel Wilson has proven that the whole enterprise of making, using and disseminating the results of the educational malpractices that are educational standards and standardized testing, that are epistemologically and ontologically lacking in fidelity to truth, would it be fair to say that currently there are many idiots in the lawmaker houses, in the state departments of education, in district central offices, and in the schools themselves? I know my answer, do you know yours?
Read on to more fully understand the abuse and misuse of the English language that has occurred in educational standards and standardized testing discourse. Much of that discourse falsely proclaims an objectiveness and scientific attitude that isn’t there. I will focus on the two most fundamental concepts of those malpractices ‘standards’ and ‘measurement’ to show just how shockingly atrocious and scandalous that usage is.
- See: Chapters 13 “The Four Faces of Error” and Chapter 17 “Error and the Reconceptualization of Validity in “Educational Standards and the Problem of Error.”
- By evaluations I am not including multiple choice, true/false or matching questions (although there can even be errors made in grading these simple tests) but more complex grading involving rubrics or other multiple point schemes.
- In his 1997 dissertation “Educational Standards and the Problem of Error” which the folks at the APA, the AERA and the NCME, the supposed keepers of the holy grail of standardized testing choose to continue to ignore and not respond to at least not in their latest, 2014 version of “Standards for Educational and Psychological Testing” where one would think that they would address Wilson’s concerns. I believe they have had ample time to find out about and address those issues. Professional irresponsibility?