If Newspapers Are Going To Publish Teachers' Value-Added Scores, They Need To Publish Error Margins Too

It seems as though New York City newspapers are going to receive the value-added scores of the city’s public school teachers, and publish them in an online database, as was the case in Los Angeles.*

In my opinion, the publication will not only serve no useful purpose educationally, but it is also a grossly unfair infringement on the privacy of teachers. I have also argued previously that putting the estimates online may serve to bias future results by exacerbating the non-random assignment of students to teachers (parents requesting [or not requesting] specific teachers based on published ratings), though it's worth noting that the city is now using a different model.

That said, I don’t think there’s any way to avoid publication, given that about a dozen newspapers will receive the data, and it’s unlikely that every one of them will decline to do so. So, in addition to expressing my firm opposition, I would offer what I consider to be an absolutely necessary suggestion: If newspapers are going to publish the estimates, they need to publish the error margins too.

Guessing About NAEP Results

Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.

A Case For Value-Added In Low-Stakes Contexts

Most of the controversy surrounding value-added and other test-based models of teacher productivity centers on the high-stakes use of these estimates. This is unfortunate – no matter what you think about these methods in the high-stakes context, they have a great deal of potential to improve instruction.

When supporters of value-added and other growth models talk about low-stakes applications, they tend to assert that the data will inspire and motivate teachers who are completely unaware that they’re not raising test scores. In other words, confronted with the value-added evidence that their performance is subpar (at least as far as tests are an indication), teachers will rethink their approach. I don’t find this very compelling. Value-added data will not help teachers – even those who believe in its utility – unless they know why their students’ performance appears to be comparatively low. It’s rather like telling a baseball player they’re not getting hits, or telling a chef that the food is bad – it’s not constructive.

Granted, a big problem is that value-added models are not actually designed to tell us why teachers get different results – i.e., whether certain instructional practices are associated with better student performance. But the data can be made useful in this context; the key is to present the information to teachers in the right way, and rely on their expertise to use it effectively.

A Big Open Question: Do Value-Added Estimates Match Up With Teachers' Opinions Of Their Colleagues?

A recent article about the implementation of new teacher evaluations in Tennessee details some of the complicated issues with which state officials, teachers and administrators are dealing in adapting to the new system. One of these issues is somewhat technical – whether the various components of evaluations, most notably principal observations and test-based productivity measures (e.g., value-added) – tend to “match up." That is, whether teachers who score high on one measure tend to do similarly well on the other (see here for more on this issue).

In discussing this type of validation exercise, the article notes:

If they don't match up, the system's usefulness and reliability could come into question, and it could lose credibility among educators.
Value-added and other test-based measures of teacher productivity may have a credibility problem among many (but definitely not all) teachers, but I don’t think it’s due to – or can be helped much by – whether or not these estimates match up with observations or other measures being incorporated into states’ new systems. I’m all for this type of research (see here and here), but I’ve never seen what I think would be an extremely useful study for addressing the credibility issue among teachers: One that looked at the relationship between value-added estimates and teachers’ opinions of each other.

A Look Inside Principals' Decisions To Dismiss Teachers

Despite all the heated talk about how to identify and dismiss low-performing teachers, there’s relatively little research on how administrators choose whom to dismiss, whether various dismissal options might actually serve to improve performance, and other aspects in this area. A paper by economist Brian Jacob, released as working paper in 2010 and published late last year in the journal Education Evaluation and Policy Analysis, helps address at least one of these voids, by providing one of the few recent glimpses into administrators’ actual dismissal decisions.

Jacob exploits a change in Chicago Public Schools (CPS) personnel policy that took effect for the 2004-05 school year, one which strengthened principals’ ability to dismiss probationary teachers, allowing non-renewal for any reason, with minimal documentation. He was able to link these personnel records to student test scores, teacher and school characteristics and other variables, in order to examine the characteristics that principals might be considering, directly or indirectly, in deciding who would and would not be dismissed.

Jacob’s findings are intriguing, suggesting a more complicated situation than is sometimes acknowledged in the ongoing debate over teacher dismissal policy.

Fundamental Flaws In The IFF Report On D.C. Schools

A new report, commissioned by the District of Columbia Mayor Vincent Gray and conducted by the Chicago-based consulting organization IFF, was supposed to provide guidance on how the District might act and invest strategically in school improvement, including optimizing the distribution of students across schools, many of which are either over- or under-enrolled.

Needless to say, this is a monumental task. Not only does it entail the identification of high- and low-performing schools, but plans for improving them as well. Even the most rigorous efforts to achieve these goals, especially in a large city like D.C., would be to some degree speculative and error-prone.

This is not a rigorous effort. IFF’s final report is polished and attractive, with lovely maps and color-coded tables presenting a lot of summary statistics. But there’s no emperor underneath those clothes. The report's data and analysis are so deeply flawed that its (rather non-specific) recommendations should not be taken seriously.

The Perilous Conflation Of Student And School Performance

Unlike many of my colleagues and friends, I personally support the use of standardized testing results in education policy, even, with caution and in a limited role, in high-stakes decisions. That said, I also think that the focus on test scores has gone way too far and their use is being implemented unwisely, in many cases to a degree at which I believe the policies will not only fail to generate improvement, but may even risk harm.

In addition, of course, tests have a very productive low-stakes role to play on the ground – for example, when teachers and administrators use the results for diagnosis and to inform instruction.

Frankly, I would be a lot more comfortable with the role of testing data – whether in policy, on the ground, or in our public discourse – but for the relentless flow of misinterpretation from both supporters and opponents. In my experience (which I acknowledge may not be representative of reality), by far the most common mistake is the conflation of student and school performance, as measured by testing results.

Consider the following three stylized arguments, which you can hear in some form almost every week:

Schools' Effectiveness Varies By What They Do, Not What They Are

There may be a mini-trend emerging in certain types of charter school analyses, one that seems a bit trivial but has interesting implications that bear on the debate about charter schools in general. It pertains to how charter effects are presented.

Usually, when researchers estimate the effect of some intervention, the main finding is the overall impact, perhaps accompanied by a breakdown by subgroups and supplemental analyses. In the case of charter schools, this would be the estimated overall difference in performance (usually testing gains) between students attending charters versus their counterparts in comparable regular public schools.

Two relatively recent charter school reports, however – both generally well-done given their scope and available data – have taken a somewhat different approach, at least in the “public roll-out” of their results.

Getting Ready For The Common Core

Our guest author today, Susan B. Neuman, is a professor in Educational Studies at the University of Michigan specializing in early literacy development and a former U.S. Secretary of Education for Elementary and Secondary Education. She and her colleagues at the University of Michigan have also partnered with the Albert Shanker Institute in sponsoring a summer institute for early childhood educators, focusing specifically on oral language development and the ways it can support and help build strong content knowledge. For more information, see here.

States are now working intently on developing plans that will make new, common standards a reality. A recent report from Education First and the Editorial Projects in Education Research Center concludes that that all but one of the 47 states adopting the Common Core State Standards is now in the implementation phase. Seven states have fully upgraded professional development, curriculum materials, and evaluation systems in preparation for the 2014-2015 school year.

Nary a word has been spoken about how to prepare teachers to implement common standards appropriately in the early childhood years. Although the emphasis on content-rich instruction in ways that builds knowledge is an important one, standards groups have virtually ignored the early years when these critical skills first begin to develop.

Young children are eager to learn about the sciences, arts, and the world around them. And, as many early childhood teachers recognize, we need to provide content-rich instruction that is both developmentally appropriate and highly engaging to support students' learning.

A Dark Day For Educational Measurement In The Sunshine State

Just this week, Florida announced its new district grading system. These systems have been popping up all over the nation, and given the fact that designing one is a requirement of states applying for No Child Left Behind waivers, we are sure to see more.

I acknowledge that the designers of these schemes have the difficult job of balancing accessibility and accuracy. Moreover, the latter requirement – accuracy – cannot be directly tested, since we cannot know “true” school quality. As a result, to whatever degree it can be partially approximated using test scores, disagreements over what specific measures to include and how to include them are inevitable (see these brief analyses of Ohio and California).

As I’ve discussed before, there are two general types of test-based measures that typically comprise these systems: absolute performance and growth. Each has its strengths and weaknesses. Florida’s attempt to balance these components is a near total failure, and it shows in the results.