Creating A Valid Process For Using Teacher Value-Added Measures

** Reprinted here in the Washington Post

Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models.

Now that the election is over, the Obama Administration and policymakers nationally can return to governing. Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.

In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.

In many respects, The Race was well designed. It addresses an important problem - the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.

But they also made one decision that I think was a mistake. They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.

The idea of combining the measures has some advantages. For example, as I wrote in my book on about value-added measures, combined measures have greater reliability and probably better validity as well. But there is also one major issue: Teachers by and large do not like or trust value-added measures. There are some good reasons for this: The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance. There is more debate about whether the measures are, in any given year, providing useful information about “true” teacher performance (i.e., whether they are valid).

The larger problem is that policymakers have tended to look at the teacher evaluation problem like measurement experts rather than school leaders. Measurement experts naturally want validity and reliable measures—ones that accurately capture teacher effectiveness. School leaders, on the other hand, can and should be more concerned about whether the entire process leads to valid and reliable conclusions about teacher effectiveness. The process includes measures, but also clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes. It is that process, perhaps as much as the measures themselves, that instills trust in the system among educators. But the idea of combining multiple measures has short-circuited discussion about how the multiple measures—and especially value-added—could be used to create a better process.

One possible process comes from the medical profession. It is common for doctors to “screen” for major diseases, using procedures that can identify all the people who do have the disease, but some who do not (the latter being false positives). Those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate. They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.

Ineffective teachers could be identified the same way.

Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.

The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance. For this reason, value-added would not be the sole screener. Instead, some other measure could also be used as a screener. If teachers failed on either measure, then that would be a reason for collecting additional information. (This approach also solves another problem discussed later.)

There is a second way in which value-added could be used as a screener – not of teachers, but of their teacher evaluators. To explain how, I need to say more about the “other” measures in an evaluation system. Almost every school system that has moved to alternative teacher evaluations has chosen to also use classroom observations by peers, master teachers, and/or school principals. The Danielson Framework, PLATO, and others are now household names among educators. Classroom observations have many advantages: They allow the observer to take account of the local context. They yield information that is more useful to teachers for improving practice. And we can increase their reliability by observing teachers more often.

The difficulty is that these measures, too, have validity and reliability issues. Two observers can look at the same classroom and see different things. That problem is more likely when the observers vary in their training. Also, some observers might know teachers’ value-added scores and let those color their views during the observations - they might think, “I already know this teacher is not very good so I will give her a low score."

Value-added measures might actually be used to fix these problems with classroom observations. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare (validate, if you will) the scores across individual observers. Inaccurate classroom observation scores would likely show up as low correlations with value-added. Conversely, if observers were having their scores influenced by value-added, then the correlations might be very high, which might also be a red flag.

In these cases, an additional observer might be used to make sure the information is accurate. In other words, value-added can screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the system but without being the determining factor in personnel decisions.

This screening approach would solve a host of problems.

The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems. The NEA and AFT themselves have been rightly critical of traditional-style evaluation systems because they provide so little useful feedback to teachers. Screening with value-added places the emphasis on formative, feedback-based measures such as observations.
The screening approach represents a “feedback loop” in which both value-added and observations are used to ensure that the other is functioning well - i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking. All measures have their flaws and value-added can help address these.
The screening approach ensures that value-added measures are never the primary determinants of high-stakes personnel decisions. Rather, in this alternative proposal, value-added would only serve to trigger a closer look at a teacher’s performance, but the actual decisions would be based on classroom observations by experts. These have much greater support among teachers and provide more useful feedback.
The screening approach helps schools focus their evaluation resources where they count: On low-performing teachers and low-performing classroom observers. This is crucial in these tough economic and fiscal times, during which schools must allocate resources carefully.
The screening approach can be applied to all teachers, not just those in tested grades and subjects. A common criticism of value-added is that it cannot be applied to all teachers. With the approach I am proposing, only the initial screening process would differ (e.g., a single classroom observation that all teachers would receive) and the remainder of the process could be based on a more standard set of measures (additional classroom observations).
The screening approach, because it works in all grades and subjects, avoids the unfortunate response, in states such as Florida, of expanding testing to every grade and subject. Teaching to the test is a real problem and this will make it worse. Value-added could serve to test screeners even of non-tested grades and subjects as long as those same screeners have some teachers in tested classrooms.
The screening approach ensures that there is enough information that educational leaders will be able sleep at night knowing they are making the best possible personnel decisions - that their tough choices will not be over-turned by lawsuits alleging arbitrary and capricious firings.

Since I started with a medical analogy, some might want to call this a “triage” approach. This term fits in some ways but not in others. In both cases, the focus is on allocating resources in cost-effective ways. The higher-performing teachers get less attention just as healthier patients do. On the other hand, there is a difference between this approach and medical triage, as the latter entails devoting few resources to those who are least likely to make it. Instead, part of this point is to collect more information on these struggling teachers so that personnel decisions can be made with confidence and in keeping with legal requirements.

The screening approach certainly wouldn’t solve all the problems with the new teacher evaluation systems. The choice of additional measures beyond value-added, and the implementation of these measures, are critical. So are the ways in which the evaluations are used in personnel decisions.

Value-added measures have played a valuable role in sparking this important debate, but they need not do all the heavy lifting for our reformed teacher evaluation systems. We need more than a number, but a process for identifying low-performing teachers and helping them get better.

- Douglas Harris

Blog Topics

Ed--My argument is more of a semantic one at this point. The idea of only using VAMs after observations isn’t functionally different from a system where VAMs make up X% of the evaluation, with X being some number low enough that VAMs can’t be solely responsible for dismissal. If in fact there are lots of systems where high test score variance can lead teachers to be dismissed based on VAMs alone when the designers did not intend it that way, then this system does fix that, but I’m skeptical of how often that occurs in practice. That’s why I think point #3 in the post is less a consequence of an original system and more a result of ensuring a low enough VAM component. This matters because at this point it may be easier to convince people to have lower VAM components than to adopt a new type of sequential system.

I've heard this said many times lately, but it is true - teaching (and education) cannot be simplified into an algorithm. The diversity of any given classroom on any given day, let alone any given year, will always make the evaluation process a messy one. VAM is yet another effort to try and neaten up a process that really defies simplification. It also puts yet an even heavier weight on standardized tests that they were never meant to bear. The push for VAM is part of a larger catch-22 created by NCLB. Most states adopting it ignore the studies citing the ineffectiveness of such measures, because in the larger picture, including VAM will allow them opportunity to apply for waivers from the more punitive elements of NCLB. It's a sad cycle.

" The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance." This quote is later followed by step one of the process - "The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems." How is making anything that is not very reliable a reasonable first step in a teacher's evaluation no matter what weight it is given?

There is also this statement - "The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance." Again, if the measures are not reliable, just what percentage of low performers are captured and just how many high performers get low scores by chance? If the we are right to be so critical of traditional style evaluations, how does adding an unreliable component help since neither the measures nor the traditional style evaluations are apparently reliable in identifying low performers.

Actually, step one conflates two different ideas, the identification of low performers and useful feedback to teachers. If the traditional style evaluation does not provide useful feedback to a teacher, how do unreliable measures that have nothing to do with performance help?

In the guise of seeming reasonable, this screening process in step two tries to give legitimacy to using value added measures in conjunction with observations by suggesting two unreliable processes can somehow provide a "“feedback loop” in which both value-added and observations are used to ensure that the other is functioning well – i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking." This presupposes value added measures can be tied directly to teacher activities an observer should be able to see. Furthermore, we have now added another layer of complexity to the process in which a teacher is considered a poor performer because of an unreliable measure and an observer is possibly considered a poor observer whose performance might be lacking because the observer's feedback is not aligned with that same unreliable measure. Unless we have observers for the observers, how does this situation not bias an observer's judgement toward validating that value added measure?

Maybe value added measures could be useful as a starting point in reflecting on, conversing about, or focusing observations on current practices though all those things could take place without value added measures. The real question is why step three does not say the screening approach ensures that value-added measures are never used as determinants of high-stakes personnel decisions?