Certainty And Good Policymaking Don't Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

Our Annual Testing Data Charade

Every year, around this time, states and districts throughout the nation release their official testing results. Schools are closed and reputations are made or broken by these data. But this annual tradition is, in some places, becoming a charade.

Most states and districts release two types of assessment data every year (by student subgroup, school and grade): Average scores (“scale scores”); and the percent of students who meet the standards to be labeled proficient, advanced, basic and below basic. The latter type – the rates – are of course derived from the scores – that is, they tell us the proportion of students whose scale score was above the minimum necessary to be considered proficient, advanced, etc.

Both types of data are cross-sectional. They don’t follow individual students over time, but rather give a “snapshot” of aggregate performance among two different groups of students (for example, third graders in 2010 compared with third graders in 2011). Calling the change in these results “progress” or “gains” is inaccurate; they are cohort changes, and might just as well be chalked up to differences in the characteristics of the students (especially when changes are small). Even averaged across an entire school or district, there can be huge differences in the groups compared between years – not only is there often considerable student mobility in and out of schools/districts, but every year, a new cohort enters at the lowest tested grade, while a whole other cohort exits at the highest tested grade (except for those retained).

For these reasons, any comparisons between years must be done with extreme caution, but the most common way - simply comparing proficiency rates between years - is in many respects the worst. A closer look at this year’s New York City results illustrates this perfectly.

Teachers' Preparation Routes And Policy Views

In a previous post, I lamented the scarcity of survey data measuring what teachers think of different education policy reforms. A couple of weeks ago, the National Center for Education Information (NCEI) released the results of their teacher survey (conducted every five years), which provides a useful snapshot of teachers’ opinions toward different policies (albeit not at the level of detail that one might wish).

There are too many interesting results to review in one post, and I encourage you to take a look at the full set yourself. There was, however, one thing about the survey tabulations that I found particularly striking, and that was the high degree to which policy opinions differed between traditionally-certified teachers and those who entered teaching through alternative certification (alt-cert).

In the figure below, I reproduce data from the NCEI report’s battery of questions about whether teachers think different policies would “improve education." Respondents are divided by preparation route – traditional and alternative.

Test-Based Teacher Evaluations Are The Status Quo

We talk a lot about the “status quo” in our education debates. For instance, there is a common argument that the failure to use evidence of “student learning” (in practice, usually defined in terms of test scores) in teacher evaluations represents the “status quo” in this (very important) area.

Now, the implication that “anything is better than the status quo” is a rather massive fallacy in public policy, as it assumes that the costs of alternatives will outweigh benefits, and that there is no chance the replacement policy will have a negative impact (almost always an unsafe assumption). But, in the case of teacher evaluations, the “status quo” is no longer what people seem to think.

Not counting Puerto Rico and Hawaii, the ten largest school districts in the U.S. are (in order): New York City; Los Angeles; Chicago; Dade County (FL); Clark County (NV); Broward County (FL); Houston; Hillsborough (FL); Orange County (FL); and Palm Beach County (FL). Together, they serve about eight percent of all K-12 public school students in the U.S., and over one in ten of the nation’s low-income children.

Although details vary, every single one of them is either currently using test-based measures of effectiveness in its evaluations, or is in the process of designing/implementing these systems (most due to statewide legislation).

Merit Pay: The End Of Innocence?

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

The current teacher salary scale has come under increasing fire, and for a reason. Systems where people are treated more or less the same suffer from two basic problems. First, there will always be a number of "free riders." Second, and relatedly, some people may feel their contributions aren’t sufficiently recognized. So, what are good alternatives? I am not sure; but based on decades worth of economic and psychological research, measures such as merit pay are not it.

Although individual pay for performance (or merit pay) is a widespread practice among U.S. businesses, the research on its effectiveness shows it to be of limited utility (see here, here, here, and here), mostly because it’s easy for its benefits to be swamped by unintended consequences. Indeed, psychological research indicates that a focus on financial rewards may serve to (a) reduce intrinsic motivation, (b) heighten stress to the point that it impairs performance, and (c) promote a narrow focus reducing how well people do in all dimensions except the one being measured.

In 1971, a research psychologist named Edward Deci published a paper concluding that, while verbal reinforcement and positive feedback tends to strengthen intrinsic motivation, monetary rewards tend to weaken it. In 1999, Deci and his colleagues published a meta-analysis of 128 studies (see here), again concluding that, when people do things in exchange for external rewards, their intrinsic motivation tends to diminish. That is, once a certain activity is associated with a tangible reward, such as money, people will be less inclined to participate in the task when the reward is not present. Deci concluded that extrinsic rewards make it harder for people to sustain self-motivation.

Attracting The "Best Candidates" To Teaching

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

One of the few issues that all sides in the education debate agree upon is the desirability of attracting “better people” into the teaching profession. While this certainly includes the possibility of using policy to lure career-switchers, most of the focus is on attracting “top” candidates right out of college or graduate school.

The common metric that is used to identify these “top” candidates is their pre-service (especially college) characteristics and performance. Most commonly, people call for the need to attract teachers from the “top third” of graduating classes, an outcome that is frequently cited as being the case in high-performing nations such as Finland. Now, it bears noting that “attracting better people," like “improving teacher quality," is a policy goal, not a concrete policy proposal – it tells us what we want, not how to get it. And how to make teaching more enticing for “top” candidates is still very much an open question (as is the equally important question of how to improve the performance of existing teachers).

In order to answer that question, we need to have some idea of whom we’re pursuing – who are these “top” candidates, and what do they want? I sometimes worry that our conception of this group – in terms of the “top third” and similar constructions – doesn’t quite square with the evidence, and that this misconception might actually be misguiding rather than focusing our policy discussions.

Comparing Teacher Turnover In Charter And Regular Public Schools

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

A couple of weeks ago, a new working paper on teacher turnover in Los Angeles got a lot of attention, and for good reason. Teacher turnover, which tends to be alarmingly high in lower-income schools and districts, has been identified as a major impediment to improvements in student achievement.

Unfortunately, some of the media coverage of this paper has tended to miss the mark. Mostly, we have seen horserace stories focusing on fact that many charter schools have very high teacher turnover rates, much higher than most regular public schools in LA. The problem is that, as a group, charter school teachers are significantly dissimilar to their public school peers. For instance, they tend to be younger and/or less experienced than public school teachers overall; and younger, less experienced teachers tend to exhibit higher levels of turnover across all types of schools. So, if there is more overall churn in charter schools, this may simply be a result of the demographics of the teaching force or other factors, rather than any direct effect of charter schools per se (e.g., more difficult working conditions).

But the important results in this paper aren’t about the amount of turnover in charters versus regular public schools, which can measured very easily, but rather the likelihood that similar teachers in these schools will exit.

Melodramatic

At a press conference earlier this week, New York City Mayor Michael Bloomberg announced the city’s 2011 test results. Wall Street Journal reporter Lisa Fleisher, who was on the scene, tweeted Mayor Bloomberg’s remarks. According to Fleisher, the mayor claimed that there was a “dramatic difference” between his city’s testing progress between 2010 and 2011, as compared with the rest of state.

Putting aside the fact that the results do not measure “progress” per se, but rather cohort changes – a comparison of cross-sectional data that measures the aggregate performance of two different groups of students – I must say that I was a little astounded by this claim. Fleisher was also kind enough to tweet a photograph that the mayor put on the screen in order to illustrate the “dramatic difference” between the gains of NYC students relative to their non-NYC counterparts across the state.  Here it is:

Again, Niche Reforms Are Not The Answer

Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors.

A recent response to my previous post on these pages helps to underscore one of my central points: If there is no clarity about what it will take to improve schools, it will be difficult to design a system that can do it.  In a recent essay in the Sunday New York Times Magazine, Paul Tough wrote that education reformers who advocated "no excuses" schooling were now making excuses for reformed schools' weak performance.  He explained why: " Most likely for the same reason that urban educators from an earlier generation made excuses: successfully educating large numbers of low-income kids is very, very hard." 

 In his post criticizing my initial essay, "What does it mean to ‘fix the system’?," the Fordham Institute’s Chris Tessone told the story of how Newark Public Schools tried to meet the requirements of a federal school turnaround grant. The terms of the grant required that each of three failing high school replace at least half of their staff. The schools, he wrote, met this requirement largely by swapping a portion of their staffs with one another, a process which Tessone and school administrators refer to as the “dance of the lemons.”Would such replacement be likely to solve the problem?

Even if all of the replaced teachers had been weak (which we do not know), I doubt that such replacement could have done much to help.

If Gifted And Talented Programs Don't Boost Scores, Should We Eliminate Them?

In education policy debates, the phrase “what works” is sometimes used to mean “what increases test scores." Among those of us who believe that testing data have a productive role to play in education policy (even if we disagree on the details of that role), there is a constant struggle to interpret test-based evidence properly and put it in context. This effort to craft and maintain a framework for using assessment data productively is very important but, despite the careless claims of some public figures, it is also extremely difficult.

Equally important and difficult is the need to apply that framework consistently. For instance, a recent working paper from the National Bureau of Economic Research (NBER) looked at the question of whether gifted and talented (GT) programs boost student achievement. The researchers found that GT programs (and magnet schools as well) have little discernible impact on students’ test score gains. Another recent NBER paper reached the same conclusion about the highly-selective “exam schools” in New York and Boston. Now, it’s certainly true that high-quality research on the test-based effect of these programs is still somewhat scarce, and these are only two (as yet unpublished) analyses, but their conclusions are certainly worth noting.

Still, let’s speculate for a moment: Let’s say that, over the next few years, several other good studies also reached the same conclusion. Would anyone, based on this evidence, be calling for the elimination of GT programs? I doubt it. Yet, if we applied faithfully the standards by which we sometimes judge other policy interventions, we would have to make a case for getting rid of GT.