How Many Teachers Does It Take To Close An Achievement Gap?

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Over the weekend, New York Times columnist Nick Kristof made a persuasive argument that teachers should be paid more. In making his case, he also put forth a point that you’ve probably heard before: “One Los Angeles study found that having a teacher from the 25 percent most effective group of teachers for four years in a row would be enough to eliminate the black-white achievement gap."

This is an instance of what we might call the "X consecutive teachers” argument (sometimes it’s three, sometimes four or five). It is often invoked to support, directly or indirectly, specific policy prescriptions, such as merit pay, ending tenure, or, in this case, higher salaries (see also here and here). To his credit, Kristof’s use of the argument is on the cautious side, but there are plenty of examples in which it used as evidence supporting particular policies.

Actually, the day after the column ran, in a 60 Minutes segment featuring “The Equity Project," a charter school that pays its teachers $125,000 a year, the school’s principal was asked how he planned to narrow the achievement gap with his school. His reply was: “The difference between a great teacher and a mediocre or poor teacher is several grade levels of achievement in a given year. A school that focuses all of its energy and its resources on fantastic teaching can bridge the achievement gap."

Indeed, it is among the most common arguments in our education policy debate today.  In reality, however, it is little more than a stylistic riff on empirical research findings, and a rough one at that. It is not at all useful when it comes to choosing between different policy options.

Students First, Facts Later

On Wednesday, Michelle Rhee’s new organization, Students First, rolled out its first big policy campaign: It’s called “Save Great Teachers," and it is focused on ending so-called “seniority-based layoffs."

Rhee made several assertions at the initial press conference and in an accompanying op-ed in the Atlanta Constitution Journal (and one on CNN.com). At least three of these claims address the empirical research on teacher layoffs and quality. Two are false; the other is misleading. If history is any guide, she is certain to repeat these “findings” many times in the coming months.

As discussed in a previous post, I actually support the development of a better alternative to seniority-based layoffs, but I am concerned that the debate is proceeding as if we already have one (most places don't), and that there's quite a bit of outrage-inspiring misinformation flying around on this topic. So, in the interest of keeping the discussion honest, as well as highlighting a few issues that bear on the layoff debate generally, I do want to try and correct Rhee preemptively.

Value-Added: Theory Versus Practice

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments.  They were also, for reasons yet unknown, unable to replicate the results.

Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:

...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.
Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."

The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.

A Quality-Based Look At Seniority-Based Layoffs

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Eliminating seniority-based layoffs is a policy idea that is making the rounds these days, with proponents making special appeals to cash-strapped states and districts desperately looking for ways to save money while minimizing decreases in the quality of services.  Mayors, editorial boards, and others have joined in the chorus.

There’s a few existing high-quality simulations that compare seniority-based layoffs with one alternative – laying off based on teachers’ value-added scores (most recently, one analysis of Washington State and another using data from New York City; both are worth reading).  Unsurprisingly, the simulations show that the two policies would not lay off the same teachers, and that the seniority-based layoffs would save less money for the same number of dismissals (since the least experienced teachers are paid less).  In addition, the teachers laid off based on seniority have lower average value-added scores than those laid off based on those value-added scores (as would inevitably be the case).

Based in part on these and other analyses, critics have a pretty solid argument on the surface: Seniority makes us “fire good teachers” simply because they don’t have enough experience, and we can fire fewer teachers if we use “quality” instead of seniority. 

To be clear: I think that there is a sound case for exploring alternatives to seniority-based layoffs, but many of the recent arguments for so-called “quality-based” layoffs have been so simplistic and reactionary that they may actually serve to deter serious conversations about how to change these practices.

The 5-10 Percent Solution

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.

In the world of education policy, the following assertion has become ubiquitous: If we just fire the bottom 5-10 percent of teachers, our test scores will be at the level of the highest-performing nations, such as Finland. Michelle Rhee likes to make this claim. So does Bill Gates.

The source and sole support for this claim is a calculation by economist Eric Hanushek, which he sketches out roughly in a chapter of the edited volume Creating a New Teaching Profession (published by the Urban Institute). The chapter is called "Teacher Deselection" (“deselection” is a polite way of saying “firing”). Hanushek is a respected economist, who has been researching education for over 30 years. He is willing to say some of the things that many other market-based reformers also believe, and say privately, but won’t always admit to in public.

So, would systematically firing large proportions of teachers every year based solely on their students’ test scores improve overall scores over time? Of course it would, at least to some degree. When you repeatedly select (or, in this case, deselect) on a measurable variable, even when the measurement is imperfect, you can usually change that outcome overall.

But anyone who says that firing the bottom 5-10 percent of teachers is all we have to do to boost our scores to Finland-like levels is selling magic beans—and not only because of cross-national poverty differences or the inherent limitations of most tests as valid measures of student learning (we’ll put these very real concerns aside for this post).

The War On Error

The debate on the use of value-added models (VAM) in teacher evaluations has reached an impasse of sorts. Opponents of VAM use contend that the imprecision is too high for the measures to be used in evaluation; supporters argue that current systems are inadequate, that all measures entail error but this doesn’t preclude using the estimates. 

This back-and-forth may be missing the mark, and it is not particularly useful in the states and districts that are already moving ahead. The more salient issue, in my view, is less about the amount of error than about how it is dealt with when the estimates are used (along with other measures) in evaluation systems.

Teachers certainly understand that some level of imprecision is inherent in any evaluation method—indeed, many will tell you about colleagues who shouldn’t be in the classroom, but receive good evaluation ratings from principals year after year. Proponents of VAM often point to this tendency of current evaluation systems to give “false positive” ratings as a reason to push forward quickly. But moving so carelessly that we disregard the error in current VAM estimates—and possible methods to reduce its negative impacts—is no different than ignoring false positives in existing systems.

Teacher Value-Added Scores: Publish And Perish

On the heels of the Los Angeles Times’ August decision to publish a database of teachers’ value-added scores, New York City newspapers are poised to do the same, with the hearing scheduled for late November.

Here’s a proposition: Those who support the use of value-added models (VAM) for any purpose should be lobbying against the release of teachers’ names and value-added scores.

The reason? Publishing the names directly compromises the accuracy of an already-compromised measure. Those who blindly advocate for publication – often saying things like “what’s the harm?" – betray their lack of knowledge about the importance of the models’ core assumptions, and the implications they carry for the accuracy of results. Indeed, the widespread publication of these databases may even threaten VAM’s future utility in public education.

Teacher Quality On The Red Carpet; Accuracy Swept Under The Rug

The media campaign surrounding “Waiting for Superman," which has already drawn considerable coverage, only promises to get bigger. While I would argue – at least in theory – that increased coverage of education is a good thing, it also means that this is a critically important (and, in some respects, dangerous) time. Millions of new people will be tuning in, believing that they are hearing serious discussions about the state of public education in America and “what the research says” about how it can be improved.

It’s therefore a sure bet that what I’ve called the “teacher effects talking point” will be making regular appearances. It goes something like this: Teachers are the most important schooling factor influencing student achievement. This argument provides much of the empirical backbone for the current push toward changes in teacher personnel policies. It is an important finding based on high-quality research, one with clear policy implications. It is also potentially very misleading.

The same body of evidence that shows that teachers are the most important within-school factor influencing test score gains also demonstrates that non-school factors matter a great deal more. The first wave of high-profile articles in our newly-energized education debate not only seem to be failing to provide this context, but are ignoring it completely. Deliberately or not, they are publishing incorrect information dressed up as empirical fact, spreading it throughout a mass audience new to the topic, to the detriment of us all.

Are Value-Added Models Objective?

In recent discussions about teacher evaluation, some people try to distinguish between "subjective" measures (such as principal and peer observations) and "objective" measures (usually referring to value-added estimates of teachers’ effects on student test scores).

In practical usage, objectivity refers to the relative absence of bias from human judgment ("pure" objectivity being unattainable). Value-added models are called "objective" because they use standardized testing data and a single tool for analyzing them: All students in a given grade/subject take the same test and all teachers’ "effects" in a given district or state are estimated by the same model. Put differently, all teachers are treated the same (at least those 25 percent or so who teach grades and subjects that are tested), and human judgment is relatively absent.

By this standard, are value-added models objective? No. And it is somewhat misleading to suggest that they are.