Ohio's New School Rating System: Different Results, Same Flawed Methods

Without question, designing school and district rating systems is a difficult task, and Ohio was somewhat ahead of the curve in attempting to do so (and they're also great about releasing a ton of data every year). As part of its application for ESEA waivers, the state recently announced a newly-designed version of its long-standing system, with the changes slated to go into effect in 2014-15. State officials told reporters that the new scheme is a “more accurate reflection of … true [school and district] quality."

In reality, however, despite its best intentions, what Ohio has done is perpetuate a troubled system by making less-than-substantive changes that seem to serve the primary purpose of giving lower grades to more schools in order for the results to square with preconceptions about the distribution of “true quality." It’s not a better system in terms of measurement - both the new and old schemes consist of mostly the same inappropriate components, and the ratings differentiate schools based largely on student characteristics rather than school performance.

So, whether or not the aggregate results seem more plausible is not particularly important, since the manner in which they're calculated is still deeply flawed. And demonstrating this is very easy.

Beware Of Anecdotes In The Value-Added Debate

A recent New York Times "teacher diary" presents the compelling account of a New York City teacher whose value-added rating was 6th percentile in 2009 – one of the lowest scores in the city – and 96th percentile the following year, one of the highest. Similar articles - for example, about teachers with errors in their rosters or scores that conflict with their colleagues'/principals' opinions - have been published since the release of the city’s teacher data reports (also see here). These accounts provoke a lot of outrage and disbelief, and that makes sense – they can sound absurd.

Stories like these can be useful as illustrations of larger trends and issues - in this case, of the unfairness of publishing the NYC scores,  most of which are based on samples that are too small to provide meaningful information. But, in the debate over using these estimates in actual policy, we need to be careful not to focus too much on anecdotes. For every one NYC teacher whose value-added rank changed over 90 points between 2009 and 2010, there are almost 100 teachers whose ranks were within 10 points (and percentile ranks overstate the actual size of all these differences). Moreover, even if the models yielded perfect measures of test-based teacher performance, there would still be many implausible fluctuations between years - those that are unlikely to be "real" change - due to nothing more than random error.*

The reliability of value-added estimates, like that of all performance measures (including classroom observations), is an important issue, and is sometimes dismissed by supporters in a cavalier fashion. There are serious concerns here, and no absolute answers. But none of this can be examined or addressed with anecdotes.

The Persistence Of Both Teacher Effects And Misinterpretations Of Research About Them

In a new National Bureau of Economic Research working paper on teacher value-added, researchers Raj Chetty, John Friedman and Jonah Rockoff present results from their analysis of an incredibly detailed dataset linking teachers and students in one large urban school district. The data include students’ testing results between 1991 and 2009, as well as proxies for future student outcomes, mostly from tax records, including college attendance (whether they were reported to have paid tuition or received scholarships), childbearing (whether they claimed dependents) and eventual earnings (as reported on the returns). Needless to say, the actual analysis includes only those students for whom testing data were available, and who could be successfully linked with teachers (with the latter group of course limited to those teaching math or reading in grades 4-8).

The paper caused a remarkable stir last week, and for good reason: It’s one of the most dense, important and interesting analyses on this topic in a very long time. Much of the reaction, however, was less than cautious, specifically the manner in which the research findings were interpreted to support actual policy implications (also see Bruce Baker’s excellent post).

What this paper shows – using an extremely detailed dataset and sophisticated, thoroughly-documented methods – is that teachers matter, perhaps in ways that some didn’t realize. What it does not show is how to measure and improve teacher quality, which are still open questions. This is a crucial distinction, one which has been discussed on this blog numerous times (also here and here), as it is frequently obscured or outright ignored in discussions of how research findings should inform concrete education policy.

The Year In Research On Market-Based Education Reform: 2011 Edition

** Also posted here on 'Valerie Strauss' Answer Sheet' in the Washington Post

If 2010 was the year of the bombshell in research in the three “major areas” of market-based education reform – charter schools, performance pay, and value-added in evaluations – then 2011 was the year of the slow, sustained march.

Last year, the landmark Race to the Top program was accompanied by a set of extremely consequential research reports, ranging from the policy-related importance of the first experimental study of teacher-level performance pay (the POINT program in Nashville) and the preliminary report of the $45 million Measures of Effective Teaching project, to the political controversy of the Los Angeles Times’ release of teachers’ scores from their commissioned analysis of Los Angeles testing data.

In 2011, on the other hand, as new schools opened and states and districts went about the hard work of designing and implementing new evaluations compensation systems, the research almost seemed to adapt to the situation. There were few (if any) "milestones," but rather a steady flow of papers and reports focused on the finer-grained details of actual policy.*

Nevertheless, a review of this year's research shows that one thing remained constant: Despite all the lofty rhetoric, what we don’t know about these interventions outweighs what we do know by an order of magnitude.

Education Advocacy Organizations: An Overview

Our guest author today is Ken Libby, a graduate student studying educational foundations, policy and practice at the University of Colorado at Boulder.

Education advocacy organizations (EAOs) come in a variety of shapes and sizes. Some focus on specific issues (e.g. human capital decisions, forms of school choice, class size) while others approach policy more broadly (e.g. changing policy environments, membership decisions). Proponents of these organizations claim they exist, at least in part, to provide a counterbalance to various other powerful interest groups.

In just the past few years, Stand for Children, Democrats for Education Reform (DFER), 50CAN, and StudentsFirst have emerged as well-organized, well-funded groups capable of influencing education policy. While these four groups support some of the same legislation - most notably teacher evaluations based in part on test scores and the expansion of school choice - each group has some distinct characteristics that are worth noting.

One thing’s for sure: The proliferation of EAOs, especially during the past five or six years, is playing a critical role in certain education policy decisions and discussions. They are not, contrary to some of the rhetoric, dominating powerhouses, but they aren’t paper tigers either.

A Few Other Principles Worth Stating

Last week, a group of around 25 education advocacy organizations, including influential players such as Democrats for Education Reform and The Education Trust, released a "statement of principles" on the role of teacher quality in the reauthorization of the Elementary and Secondary Education Act (ESEA). The statement, which is addressed to the chairs and ranking members of the Senate and House committees handling the reauthorization, lays out some guidelines for teacher-focused policy in ESEA (a draft of the full legislation was released this week; summary here).

Most of the statement is the standard fare from proponents of market-based reform, some of which I agree with in theory if not practice. What struck me as remarkable was the framing argument presented in the statement's second sentence:

Research shows overwhelmingly that the only way to close achievement gaps – both gaps between U.S. students and those in higher-achieving countries and gaps within the U.S. between poor and minority students and those more advantaged – and transform public education is to recruit, develop and retain great teachers and principals.
This assertion is false.

In Ohio, Charter School Expansion By Income, Not Performance

For over a decade, Ohio law has dictated where charter schools can open. Expansion was unlimited in Lucas County (the “pilot district” for charters) and in the “Ohio 8” urban districts (Akron, Canton, Cincinnati, Cleveland, Columbus, Dayton, Toledo, and Youngstown). But, in any given year, charters could open up in any other district that was classified as a “challenged district," as measured by whether the district received a state “report card” rating of “academic watch” or “academic emergency." This is a performance-based standard.

Under this system, there was of course very rapid charter proliferation in Lucas County and the “Ohio 8” districts. Only a small number of other districts (around 20-30 per year) “met” the  performance-based standard. As a whole, the state’s current charter law was supposed to “open up” districts for charter schools when the districts are not doing well.

Starting next year, the state is adding a fourth criterion: Any district with a “performance index” in the bottom five percent for the state will also be open for charter expansion. Although this may seem like a logical addition, in reality, the change offends basic principles of both fairness and educational measurement.

Value-Added: Theory Versus Practice

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments.  They were also, for reasons yet unknown, unable to replicate the results.

Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:

...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.
Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."

The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.

Premises, Presentation And Predetermination In The Gates MET Study

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

The National Education Policy Center today released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly-respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).

Very quickly on the project: Over two school years (this year and last), MET researchers, working  in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8.  Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.

Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily-resourced, the research team is top-notch, and the issues they’re looking at are huge.  The study is very, very important — done correctly. 

But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them." Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined."