On Teacher Evaluations, Between Myth And Fact Lies Truth

Controversial proposals for new teacher evaluation systems have generated a tremendous amount of misinformation. It has come from both “sides," ranging from minor misunderstandings to gross inaccuracies. Ostensibly to address some of these misconceptions, the advocacy group Students First (SF) recently released a "myth/fact sheet” on evaluations.

Despite the need for oversimplification inherent in “myth/fact” sheets, the genre can be useful, especially about topics such as evaluation, about which there is much confusion. When advocacy groups produce them, however, the myths and facts sometimes take the form of “arguments we don’t like versus arguments we do like."

This SF document falls into that trap. In fact, several of its claims are a little shocking. I would still like to discuss the sheet, not because I enjoy picking apart the work of others (I don’t), but rather because I think elements of both the “myths” and “facts” in this sheet could be recast as "dual myths” in a new sheet. That is, this document helps to illustrate how, in many of our most heated education debates, the polar opposite viewpoints that receive the most attention are often both incorrect, or at least severely overstated, and usually serve to preclude more productive, nuanced discussions.

Let’s take all four of SF’s “myth/fact” combinations in turn.

Myth and Fact

Myth #1

Evaluations based on student achievement penalize the teachers who work with students who are far behind grade-level.

Fact

Meaningful eductator evaluations include objective measures of student academic growth, as opposed to achievement on any single assessment, as one of several factors. In this way, teachers are recognized for their impact on student learning over the course of the school year regardless of where their students begin; teachers whose students enter their classroom several grade levels behind can demonstrate they are effective if they are able to increase their students' achievement beyond a certain rate. While their students may not yet be brought up to their current grade level, the learning gains these teachers were able to make in that year are significant, and should be recognized as such.

I'm almost tempted to leave this one alone. Putting aside the opinion-dressed-as-fact that "meaningful" evaluations must include "objective measures of student academic growth," it’s fair to argue, with caution, that the value-added components of new teacher evaluations would not necessarily “penalize” teachers whose students begin the year far behind, as the models attempt to control for prior achievement (though the same may not be stated so strongly for other components, such as observations).

However, whether the models account for lower-scoring students covers only a small slice of the real issue here. Rather, as I’ve noted before, the better guiding question is whether they can to an acceptable degree account for all the factors that are outside of teachers’ control.

So, by altering SF's "myths" and "facts" to cover this wider range, we have my proposal for a new “combined myth":

MYTHS: Value-added models will penalize any teacher who teaches disadvantaged or lower-performing children AND value-added models fully account for factors outside of teachers’ control

Of course, neither claim is necessarily true or false.

Well-designed value-added models can, on the whole, go a long way toward controlling for the many test-influencing factors outside teachers’ control, to no small extent because prior achievement helps pick up on these factors. But it is inevitable that even the best models will penalize some teachers and reward others unfairly (this is true of almost any measure). In addition, the estimates from some types of models are, at the aggregate level, associated with student characteristics such as subsidized lunch eligibility.

The key here is to avoid black and white statements and acknowledge that there will be mistakes.

For now, the better approach requires: Considering that there are different types of errors (e.g., false negative, false positive); assessing the risk of misclassification versus that of alternative measures (or alternative specifications of the same measure); and constantly checking estimates from all these measures for evidence of bias. These are serious, complicated challenges, and neither alarmist rhetoric nor casual dismissal reflects the reality of the situation.

Myth and Fact

Myth #2

Value-added measures fluctuate from year to year; basing evaluations on this measure of student growth is unfair and unreliable.

Fact

There is a growing body of research that shows value-added measurement has high predictive power in estimating teacher effects on student achievement, and that in conjunction with other measures, this component offers an important, accurate piece of information in assessing teacher performance. No evaluation system is perfect. But for too long, teacher evaluations have been virtually meaningless to teachers and principals. Moving to an evaluation that is multi-dimensional and based on measures that matter - like student academic growth - is imperative.

Right off the bat, seeing the phrase “value-added measures fluctuate from year to year” in the “myths” column is pretty amazing.

To say the least, this "myth" is a fact. Moreover, the second part – “basing evaluations on this measure…is unfair and unreliable” – is not a “myth” at all. It is, at best, a judgment call.

In the other column, some of the “facts” are nothing more than opinions, and, in a couple of cases, questionable ones at that (e.g., value-added is “accurate," and that using it in evaluations is “imperative”).

That said, let's modify the second part of SF's "myth" (unfair/unreliable) and combine it with their misleading "fact" ("high predictive power") to construct a dual myth:

MYTHS: Value-added is too unreliable to be useful AND value-added is a very strong predictor of who will be a "good teacher" the next year

The “predictive power” of value-added - for instance, its stability over time - really cannot be called "high." On the whole, it tends to be quite modest. Also, just to be clear, the claim that value-added can predict “teacher effects on student achievement” is better-phrased as "value-added can predict itself." Let's not make it sound more grand than it is.

At the same time, however, that value-added estimates are not particularly stable over time doesn’t preclude their usefulness. First, as discussed here, even a perfect model – one that fully captured teachers’ causal effects on student testing progress – would be somewhat unstable due to simple random error (e.g., from small samples). Second, all measures of any quality fluctuate between years (and that includes classroom observations). Third, some of the fluctuation is "real" - performance does actually vary between years.

So, we cannot expect perfect stability from any measure, and more stable does not always mean better, but the tendency of these estimates to fluctuate between (and within) years is most definitely an important consideration. The design of evaluations can help - e.g., using more years of data can improve stability, as can accounting for error directly and using alternative measures with which growth model estimates can be compared and combined. And we should calibrate the stakes with the precision of the information - e.g., the bar for dismissing teachers is much higher than that for, say, targeting professional development.

But, overall, there is a lot of simple human judgment involved here, and not much room for blanket statements.

Myth and Fact

Myth #3

Evaluations that tie into personnel decisions - like compensation, hiring and dismissal decisions - will make teachers want to collaborate less because they will feel like they are in competition with each other.

Fact

The goal of strong teacher evaluation is to create a tool that recognizes excellence, supports development, and addresses consistent ineffectiveness. Teachers who are effective should always be recognized as such. Principals must also be evaluated, based on multiple measures including school-wide growth. The goal is not to create unhealthy competition, but instead a school culture where teachers are continually receiving quality feedback regarding their instruction and principals become instructional leaders accountable to developing an effective staff and successful schools.

Again, it’s rather surprising to see the very real concern that using evaluations in high-stakes decisions may impede collaboration in the “myths” column. No matter what you think of these new systems, their impact on teacher teamwork and other types of behavioral outcomes is entirely uncertain, and very important. Relegating these concerns to the “myths” column is just odd.

Also, the “facts” here consist of a series of statements about what SF thinks new teacher evaluation systems should accomplish. For example, I have no doubt that “the goal [of new evaluations] is not to create unhealthy competition,” but that doesn’t mean it won’t happen. Facts are supposed to be facts.

Anyway, there’s a good “combined myth” in here, one that addresses the overly certain predictions on both ends of the spectrum:

MYTHS: New teacher evaluations will destroy teacher collaboration and create unhealthy competition AND the new evaluations will provide teachers with useful feedback and create a "professional culture"

You can hear both these arguments rather frequently. Yet the mere existence of new evaluations will do absolutely nothing. The systems might be successful in providing feedback and encouraging collaboration, or they might end up pitting teachers against each other and poisoning the professional environment in schools.

It all depends on how carefully they are designed and how well they are implemented. Both "myths" are likely to be realized, as these outcomes will vary within and between states and districts.

And, given the sheer speed at which many states have required these new systems to be up and running, there is serious cause for concern. In a few places, the time and resources allotted have been so insufficient that any successful implementation will only be attributable to the adaptability and skill of principals and teachers.

One more thing (not at all specific to Students First or this document): we need to be careful not to overdo it with all these "feel-good" statements about new evaluations providing feedback and encouraging strong culture. I don’t mean to imply that these are not among the goals. What I’m saying instead is that statements by many policy makers and advocates, not to mention the very design of many of the systems themselves, make it very clear that one big purpose of the new evaluations (in some cases, perhaps the primary stated purpose) is to make high-stakes decisions, including dismissal and pay.

We are all well aware of this, teachers most of all, and it serves nobody to bury it in flowery language. Let’s have an honest conversation.

Myth and Fact

Myth #4

For teachers who have a principal who is unreliable or petty, an evaluation based on principal observations could subject them to an unfair rating.

Fact

The most important component of strong teacher evaluations is that they are based on multiple measures that align to clear expectations, an instructional rubric, and objective measures of student growth. When principals observe their staff according to a clear rubric, and when an evalution framework is made up of multiple measures, there should be little room for unfair bias. Further, principals should be evaluated as well, based off a similar framewok that considers the effective management and development of their teaching staff.

One last time: It is surreal to see the possibility that personal issues or incompetence "could subject [teachers] to an unfair rating" on their observations in the “myths” column. How anyone could think this a "myth" is beyond me.

And the primary "fact" - that there “should be little room for unfair bias” - can only be called naïve.

Let's change the "could subject [teachers] to an unfair rating" in SF's "myth" to a "would subject," and recast it, along with their "fact," as a combined myth:

MYTHS: Classroom observations will be an easy way for petty or incompetent principals to rate teachers unfairly AND observations will be relatively unbiased tools for assessing classroom performance

As we all know, there is plenty of room for unfairness, bias or inaccuracy, even in well-designed, well-implemented teacher observations. This is among the most important concerns about these measures. But there's also the potential for observations, at least on the whole, to be useful tools.

What the available research suggests is that observers, whether principals or peers, must be thoroughly trained, must observe teachers multiple times throughout the year, and must be subject to challenge and validation. Put simply, observations must be executed with care (and there is serious doubt about whether some states are putting forth the time and resources to ensure this).

Like evaluations in general, observations by themselves are neither bad nor good. It’s how you design, implement and use them that matter. Downplaying this - e.g., characterizing the possibility of inaccuracy/bias as a "myth" - is precisely the opposite of the appropriate attitude.

***

In summary, the design and implementation of teacher evaluations is complicated, and there are few if any cut-and-dry conclusions that can be drawn at this point. Suggesting otherwise stifles desperately needed discussion, and it is by far the biggest myth of all.

- Matt Di Carlo

Permalink

"The key here is to avoid black and white statements and acknowledge that there will be mistakes."

Students First admitting an error?

And I'll flap my wings and fly to the moon, too.

Permalink

Mr. Di Carlo, as is usual and correct for him, is clinically precise in his analysis and as such does not delve into the motivations, the reasons for being of this latest construct from Students First, a partisan advocacy/lobbying group. One must place "Myths vs. Facts" in the context of their policy positions and prior conduct to have a true picture of the reasons for it's publication. Doing so, one comes to the inescapable conclusion that this is just another in a long series of deceptive sales pitches meant to misrepresent those who oppose Students First's policies on a far more factual basis than Students First is able to bring to bear in their defense. As has been seen so regularly in the past, factual information turns out to be the enemy of this "data driven" organization. Placed in the context of their habitual dependence on disinformation and their regular use of straw man misrepresentations of those who oppose them, Students First motivations for this latest diversionary sales pitch become clear and are entirely constant with previous instances of such behavior. A clear example of this strategy is fully evident in their leader Michelle Rhee's March 6th Op Ed in the Seattle Times where she made no factual statements whatsoever on the position of Seattle teachers and their motivations for the boycott of the MAP test. The refutation of the Op Ed is equal to Mr. Di Carlo's precision in his own work even as it surpasses it in scope as it must do to lay bare the goals behind Rhee's cherry picking of issue fragments to misrepresent in the furtherance of the reformer agenda. http://prosserjohn.tumblr.com/post/44848476440/michelle-rhee-is-wrong

Permalink

Matt, here's the thing:

The NY Growth model DOES show a bias with regards to previous scores or economic distress (p.33):

http://schoolfinance101.files.wordpress.com/2012/11/growth-model-11-12-…

So we've got at least one instance of a VAM (I think this qualifies as a VAM) that DOES penalize teachers who work with lower-scoring kids.

Will ALL VAMs penalize ANY teacher working with lower-scoring kids? Obviously, we can't say that.

But is there a bias - at least in the NY model? Yes. So I don't think it's fair to say this is a "myth." There is evidence to at least be concerned.

Permalink

My findings are so far that in almost every case where polarized opinions are being generated, there are reasons why people present their perspectives with some hyperbole - proponents on both sides are trying to sway people to their side of the argument, and unfortunately, it seems to me to be rare that reasoned arguments sway opinion. Instead people are more likely to change their opinion when presented with highly emotional arguments, which leads people on both sides of a debate making more and more ridiculous claims in an effort to produce greater and greater emotional responses from the people whose opinion they are trying to change.

An important goal of teaching should therefore be to teach people how (especially in important cases like whether or not we should be evaluating teacher performance using factors like their student's performance on external assessments) to dissect arguments and look for claims within those arguments which can be evaluated separate from the argument for accuracy, and then judge how well these arguments align with the chosen facts.

Thank Matt for working on this highly important goal of teaching!