The Weighting Game

A while back, I noted that states and districts should exercise caution in assigning weights (importance) to the components of their teacher evaluation systems before they know what the other components will be. For example, most states that have mandated new evaluation systems have specified that growth model estimates count for a certain proportion (usually 40-50 percent) of teachers’ final scores (at least those in tested grades/subjects), but it’s critical to note that the actual importance of these components will depend in no small part on what else is included in the total evaluation, and how it's incorporated into the system.

In slightly technical terms, this distinction is between nominal weights (the percentage assigned) and effective weights (the percentage that actually ends up being the case). Consider an extreme hypothetical example – let’s say a district implements an evaluation system in which half the final score is value-added and half is observations. But let’s also say that every teacher gets the same observation score. In this case, even though the assigned (nominal) weight for value-added is 50 percent, the actual importance (effective weight) will be 100 percent, since every teacher receives the same observation score, and so all the variation between teachers’ final scores will be determined by the value-added component.

This issue of nominal/versus effective weights is very important, and, with exceptions, it gets almost no attention. And it’s not just important in teacher evaluations. It’s also relevant to states’ school/district grading systems. So, I think it would be useful to quickly illustrate this concept in the context of Florida’s new district grading system.

There's No One Correct Way To Rate Schools

Education Week reports on the growth of websites that attempt to provide parents with help in choosing schools, including rating schools according to testing results. The most prominent of these sites is GreatSchools.org. Its test-based school ratings could not be more simplistic – they are essentially just percentile rankings of schools’ proficiency rates as compared to all other schools in their states (the site also provides warnings about the data, along with a bunch of non-testing information).

This is the kind of indicator that I have criticized when reviewing states’ school/district “grading systems." And it is indeed a poor measure, albeit one that is widely available and easy to understand. But it’s worth quickly discussing the fact that such criticism is conditional on how the ratings are employed - there is a difference between the use of testing data to rate schools for parents versus for high-stakes accountability purposes.

In other words, the utility and proper interpretation of data vary by context, and there's no one "correct way" to rate schools. The optimal design might differ depending on the purpose for which the ratings will be used. In fact, the reasons why a measure is problematic in one context might very well be a source of strength in another.

If Your Evidence Is Changes In Proficiency Rates, You Probably Don't Have Much Evidence

Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.

The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.

Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.

So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.

Guessing About NAEP Results

Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.

Fundamental Flaws In The IFF Report On D.C. Schools

A new report, commissioned by the District of Columbia Mayor Vincent Gray and conducted by the Chicago-based consulting organization IFF, was supposed to provide guidance on how the District might act and invest strategically in school improvement, including optimizing the distribution of students across schools, many of which are either over- or under-enrolled.

Needless to say, this is a monumental task. Not only does it entail the identification of high- and low-performing schools, but plans for improving them as well. Even the most rigorous efforts to achieve these goals, especially in a large city like D.C., would be to some degree speculative and error-prone.

This is not a rigorous effort. IFF’s final report is polished and attractive, with lovely maps and color-coded tables presenting a lot of summary statistics. But there’s no emperor underneath those clothes. The report's data and analysis are so deeply flawed that its (rather non-specific) recommendations should not be taken seriously.

The Perilous Conflation Of Student And School Performance

Unlike many of my colleagues and friends, I personally support the use of standardized testing results in education policy, even, with caution and in a limited role, in high-stakes decisions. That said, I also think that the focus on test scores has gone way too far and their use is being implemented unwisely, in many cases to a degree at which I believe the policies will not only fail to generate improvement, but may even risk harm.

In addition, of course, tests have a very productive low-stakes role to play on the ground – for example, when teachers and administrators use the results for diagnosis and to inform instruction.

Frankly, I would be a lot more comfortable with the role of testing data – whether in policy, on the ground, or in our public discourse – but for the relentless flow of misinterpretation from both supporters and opponents. In my experience (which I acknowledge may not be representative of reality), by far the most common mistake is the conflation of student and school performance, as measured by testing results.

Consider the following three stylized arguments, which you can hear in some form almost every week:

A Dark Day For Educational Measurement In The Sunshine State

Just this week, Florida announced its new district grading system. These systems have been popping up all over the nation, and given the fact that designing one is a requirement of states applying for No Child Left Behind waivers, we are sure to see more.

I acknowledge that the designers of these schemes have the difficult job of balancing accessibility and accuracy. Moreover, the latter requirement – accuracy – cannot be directly tested, since we cannot know “true” school quality. As a result, to whatever degree it can be partially approximated using test scores, disagreements over what specific measures to include and how to include them are inevitable (see these brief analyses of Ohio and California).

As I’ve discussed before, there are two general types of test-based measures that typically comprise these systems: absolute performance and growth. Each has its strengths and weaknesses. Florida’s attempt to balance these components is a near total failure, and it shows in the results.

Performance And Chance In New York's Competitive District Grant Program

New York State recently announced a new $75 million competitive grant program, which is part of its Race to the Top plan. In order to receive some of the money, districts must apply, and their applications receive a score between zero and 115. Almost a third of the points (35) are based on proposals for programs geared toward boosting student achievement, 10 points are based on need, and there are 20 possible points awarded for a description of how the proposal fits into districts’ budgets.

The remaining 50 points – almost half – of the application is based on “academic performance” over the prior year. Four measures are used to produce the 0-50 point score: One is the year-to-year change (between 2010 and 2011) in the district’s graduation rate, and the other three are changes in the state “performance index” in math, English Language Arts (ELA) and science. The “performance index” in these three subjects is calculated using a simple weighting formula that accounts for the proportion of students scoring at levels 2 (basic), 3 (proficient) and 4 (advanced).

The idea of using testing results as a criterion in the awarding of grants is to reward those districts that are performing well. Unfortunately, due to the choice of measures and how they are used, the 50 points will be biased and to no small extent based on chance.

Is California's "API Growth" A Good Measure Of School Performance?

California calls its “Academic Performance Index” (API) the “cornerstone” of its accountability system. The API is calculated as a weighted average of the proportions of students meeting proficiency and other cutoffs on the state exams.

It is a high-stakes measure. “Growth” in schools’ API scores determines whether they meet federal AYP requirements, and it is also important in the state’s own accountability regime. In addition, toward the middle of last month, the California Charter Schools Association called for the closing of ten charter schools based in part on their (three-year) API “growth” rates.

Putting aside the question of whether the API is a valid measure of student performance in any given year, using year-to-year changes in API scores in high-stakes decisions is highly problematic. The API is cross-sectional measure – it doesn’t follow students over time – and so one must assume that year-to-year changes in a school’s index do not reflect a shift in demographics or other characteristics of the cohorts of students taking the tests. Moreover, even if the changes in API scores do in fact reflect “real” progress, they do not account for all the factors outside of schools’ control that might affect performance, such as funding and differences in students’ backgrounds (see here and here, or this Mathematica paper, for more on these issues).

Better data are needed to test these assumptions directly, but we might get some idea of whether changes in schools’ API are good measures of school performance by testing how stable they are over time.

The Ratings Game: New York City Edition

Gotham Schools reports that the New York City Department of Education rolled out this year’s school report card grades by highlighting the grades’ stability between this year and last. That is, they argued that schools’ grades were roughly the same between years, which is supposed to serve as evidence of the system’s quality.

The city’s logic here is generally sound. As I’ve noted before, most schools don’t undergo drastic changes in their operations over the course of a year, and so fluctuations in grades among a large number of schools might serve as a warning sign that there’s something wrong with the measures being used. Conversely, it’s not unreasonable to expect from a high-quality rating system that, over a two-year period, some schools would get higher grades and some lower, but that most would stay put. That was the city’s argument this year.

The only problem is that this wasn’t really the case.