The Details Matter In Teacher Evaluations
Throughout the process of reforming teacher evaluation systems over the past 5-10 years, perhaps the most contentious, discussed issue was the importance, or weights, assigned to different components. Specifically, there was a great deal of debate about the proper weight to assign to test-based teacher productivity measures, such estimates from value-added and other growth models.
Some commentators, particularly those more enthusiastic about test-based accountability, argued that the new teacher evaluations somehow were not meaningful unless value-added or growth model estimates constituted a substantial proportion of teachers’ final evaluation ratings. Skeptics of test-based accountability, on the other hand, tended toward a rather different viewpoint – that test-based teacher performance measures should play little or no role in the new evaluation systems. Moreover, virtually all of the discussion of these systems’ results, once they were finally implemented, focused on the distribution of final ratings, particularly the proportions of teachers rated “ineffective.”
A recent working paper by Matthew Steinberg and Matthew Kraft directly addresses and informs this debate. Their very straightforward analysis shows just how consequential these weighting decisions, as well as choices of where to set the cutpoints for final rating categories (e.g., how many points does a teacher need to be given an “effective” versus “ineffective” rating), are for the distribution of final ratings.
Steinberg and Kraft use data from the Measures of Effective Teaching (MET) project, and, after adjusting the measures for comparability, they simulate simulate how different weights and cutpoints would affect final results of evaluations in eight large school districts. The simulated evaluation systems include observations, value-added scores, and student surveys. They also examine discrepancies in results caused by observations conducted principals versus those done by external observers.
(Note that the results discussed below apply only to teachers in tested grades and subjects – i.e., those who receive value-added scores.)
Before discussing the weights and cutpoints, it bears mentioning that MET teachers' observation scores, and thus evaluation ratings, varied quite a bit by “type” of observer – i.e., whether the observer was "internal" (e.g., the principal) or "external" (e.g., a trained master teacher). For example, in a system in which 50 percent of teachers’ ratings are based on classroom observations, the difference in the teacher “effective rates” (proportion of teachers rated “effective,” or its equivalent, or above) between internal and external observers is as small as three percentage points (under the final ratings cutpoints used in NYC and Miami’s systems), but as large as 40 percentage points (under the system used in Gwinnett County, GA). Regardless of how one interprets these differences in terms of “accuracy,” they are not only important and policy-relevant, but also something for which Steinberg and Kraft account in their analysis (by adjusting observation scores upward).
Teacher “effective rates” tend to increase as observations are assigned more weight, but this relationship is not necessarily linear. As a general rule, a larger proportion of teachers are rated “effective” or higher in systems that assign more weight to observation scores. This, again, is due to the fact that observation scores tend to be higher than those from value-added models (in no small part because value-added models impose a distribution on results).
The relationship, however, is far from linear, because of variation between districts in the cutpoints used to sort teachers' scores into final ratings categories. In NYC and Miami, for example, teachers must be awarded a higher percentage of total points in order to be rated “effective,” compared with the other five districts. As a result, the increase in total points that accrues from a boost in the observation weight does not push many teachers over the “effective” line until the observation weight gets pretty high. The teacher “effective rate” only begins to increase sharply in these districts when the observation weight reaches about 70 percent. The definitions of ratings categories mediate the impact of weighting on results.
Conversely, evaluation systems with the same component weights can produce very different results depending on where they set the ratings cutpoints. This finding, which basically states that how many teachers fall into each rating category depends on how you define the categories, is hardly shocking. More surprising, perhaps, is just how much these definitions matter. For example, if weights are held constant (50 percent observations, 45.5 percent value-added, and 4.5 percent student surveys), the teacher “effective rate” under different districts’ scoring cutpoints varies from three percent in NYC and Miami to about 85 percent in Philadelphia and Fairfax County, VA.
In other words, depending on the final ratings cutpoints, the same exact weights and same data can produce results that vary from virtually no teachers being rated “effective” to the vast majority of teachers receiving that designation. And these cutpoints are not some impossible hypothetical scenario – they are those chosen by these eight large districts.
These general findings discussed above are important, as they show not only the crucial importance for final results of choices regarding both weights and cutpoints, but also their interconnectedness. As mentioned above, virtually all of the debate about evaluation design focused on the weighting, particularly those assigned to value-added and other growth model estimates, whereas virtually all of the reactions to these systems’ first rounds of results concentrated on how many teachers were rated “ineffective.”
Yet the actual impact of the weights on the distribution of results depends a great deal on how teachers’ final scores are sorted into ratings categories, and vice-versa. (This, by the way, also invokes the important distinction, discussed here, between nominal and effective weights.)
Again, these inter-measure relationships sound obvious, but they didn’t always seem salient in the policymaking process in some (but not all) places.
For instance, some states pre-determined the ratings cutpoints but allowed districts to choose their own component weights, thus severely constraining districts’ ability to assign weights based on (perceived) educational significance. Similarly, as results from the new systems became available, and showed distributions that some commentators considered to be an implausible, most attention was paid to the component weights (e.g., that of value-added), and comparatively little to the ratings cutpoints (or, by the way, to how the components were implemented, including internal versus external observers). Making things worse, all of these decisions, in many states, had to be worked out over a very short time period, without even the benefit of a pilot year.
None of this is meant to imply that there were obviously right or wrong decisions that states and districts failed to identify. In contrast, at the time when many new systems were being designed, there was virtually no research available regarding design, and, even now, design entails a massive amount of judgment and uncertainty. Most decisions are more about tradeoffs than right or wrong.
That said, this analysis by Steinberg and Kraft is an example of how the growing body of evidence on evaluations can assist states and districts in their attempts to improve their new systems, a process which is arguably far more important than initial design. As was the case with NCLB (Davidson et al. 2015), design decisions that may not have seemed so important at first can have a serious impact on results. The details may be tiresome and unsexy, but they often determine the influence of policy.