Three Important Distinctions In How We Talk About Test Scores

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood - certainly in the education policy world, and I would say among the public as well - that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

Herding FCATs

About a week ago, Florida officials went into crisis mode after revealing that the proficiency rate on the state’s writing test (FCAT) dropped from 81 percent to 27 percent among fourth graders, with similarly large drops in the other two grades in which the test is administered (eighth and tenth). The panic was almost immediate. For one thing, performance on the writing FCAT is counted in the state’s school and district ratings. Many schools would end up with lower grades and could therefore face punitive measures.

Understandably, a huge uproar was also heard from parents and community members. How could student performance decrease so dramatically? There was so much blame going around that it was difficult to keep track – the targets included the test itself, the phase-in of the state’s new writing standards, and test-based accountability in general.

Despite all this heated back-and-forth, many people seem to have overlooked one very important, widely-applicable lesson here: That proficiency rates, which are not "scores," are often extremely sensitive to where you set the bar.

There's No One Correct Way To Rate Schools

Education Week reports on the growth of websites that attempt to provide parents with help in choosing schools, including rating schools according to testing results. The most prominent of these sites is GreatSchools.org. Its test-based school ratings could not be more simplistic – they are essentially just percentile rankings of schools’ proficiency rates as compared to all other schools in their states (the site also provides warnings about the data, along with a bunch of non-testing information).

This is the kind of indicator that I have criticized when reviewing states’ school/district “grading systems." And it is indeed a poor measure, albeit one that is widely available and easy to understand. But it’s worth quickly discussing the fact that such criticism is conditional on how the ratings are employed - there is a difference between the use of testing data to rate schools for parents versus for high-stakes accountability purposes.

In other words, the utility and proper interpretation of data vary by context, and there's no one "correct way" to rate schools. The optimal design might differ depending on the purpose for which the ratings will be used. In fact, the reasons why a measure is problematic in one context might very well be a source of strength in another.

The Challenges Of Pre-K Assessment

In the United States, nearly 1.3 million children attend publicly-funded preschool. As enrollment continues to grow, states are under pressure to prove these programs serve to increase school readiness. Thus, the task of figuring out how best to measure preschoolers’ learning outcomes has become a major policy focus.

First, it should be noted that researchers are almost unanimous in their caution about this subject. There are inherent difficulties in the accurate assessment of very young children’s learning in the fields of language, cognition, socio-emotional development, and even physical development. Young children’s attention spans tend to be short and there are wide, natural variations in children’s performance in any given domain and on any given day. Thus, great care is advised for both the design and implementation of such assessments (see here, here, and here for examples). The question of if and how to use these student assessments to determine program or staff effectiveness is even more difficult and controversial (for instance, here and here). Nevertheless, many states are already using various forms of assessment to oversee their preschool investments.

It is difficult to react to this (unsurprising) paradox. Sadly, in education, there is often a disconnect between what we know (i.e., research) and what we do (i.e., policy). But, since our general desire for accountability seems to be here to stay, a case can be made that states should, at a minimum, expand what they measure to reflect learning as accurately and broadly as possible.

So, what types of assessments are better for capturing what a four- or a five- year old knows? How might these assessments be improved?

Interpreting Achievement Gaps In New Jersey And Beyond

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

A recent statement by the New Jersey Department of Education (NJDOE) attempts to provide an empirical justification for that state’s focus on the achievement gap – the difference in testing performance between subgroups, usually defined in terms of race or income.

Achievement gaps, which receive a great deal of public attention, are very useful in that they demonstrate the differences between student subgroups at any given point in time. This is significant, policy-relevant information, as it tells us something about the inequality of educational outcomes between the groups, which does not come through when looking at overall average scores.

Although paying attention to achievement gaps is an important priority, the NJDOE statement on the issue actually speaks directly to the fact, which is well-established and quite obvious, that one must exercise caution when interpreting these gaps, particularly over time, as measures of student performance.

Guessing About NAEP Results

Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.

A Case For Value-Added In Low-Stakes Contexts

Most of the controversy surrounding value-added and other test-based models of teacher productivity centers on the high-stakes use of these estimates. This is unfortunate – no matter what you think about these methods in the high-stakes context, they have a great deal of potential to improve instruction.

When supporters of value-added and other growth models talk about low-stakes applications, they tend to assert that the data will inspire and motivate teachers who are completely unaware that they’re not raising test scores. In other words, confronted with the value-added evidence that their performance is subpar (at least as far as tests are an indication), teachers will rethink their approach. I don’t find this very compelling. Value-added data will not help teachers – even those who believe in its utility – unless they know why their students’ performance appears to be comparatively low. It’s rather like telling a baseball player they’re not getting hits, or telling a chef that the food is bad – it’s not constructive.

Granted, a big problem is that value-added models are not actually designed to tell us why teachers get different results – i.e., whether certain instructional practices are associated with better student performance. But the data can be made useful in this context; the key is to present the information to teachers in the right way, and rely on their expertise to use it effectively.

The Perilous Conflation Of Student And School Performance

Unlike many of my colleagues and friends, I personally support the use of standardized testing results in education policy, even, with caution and in a limited role, in high-stakes decisions. That said, I also think that the focus on test scores has gone way too far and their use is being implemented unwisely, in many cases to a degree at which I believe the policies will not only fail to generate improvement, but may even risk harm.

In addition, of course, tests have a very productive low-stakes role to play on the ground – for example, when teachers and administrators use the results for diagnosis and to inform instruction.

Frankly, I would be a lot more comfortable with the role of testing data – whether in policy, on the ground, or in our public discourse – but for the relentless flow of misinterpretation from both supporters and opponents. In my experience (which I acknowledge may not be representative of reality), by far the most common mistake is the conflation of student and school performance, as measured by testing results.

Consider the following three stylized arguments, which you can hear in some form almost every week:

A Dark Day For Educational Measurement In The Sunshine State

Just this week, Florida announced its new district grading system. These systems have been popping up all over the nation, and given the fact that designing one is a requirement of states applying for No Child Left Behind waivers, we are sure to see more.

I acknowledge that the designers of these schemes have the difficult job of balancing accessibility and accuracy. Moreover, the latter requirement – accuracy – cannot be directly tested, since we cannot know “true” school quality. As a result, to whatever degree it can be partially approximated using test scores, disagreements over what specific measures to include and how to include them are inevitable (see these brief analyses of Ohio and California).

As I’ve discussed before, there are two general types of test-based measures that typically comprise these systems: absolute performance and growth. Each has its strengths and weaknesses. Florida’s attempt to balance these components is a near total failure, and it shows in the results.

Performance And Chance In New York's Competitive District Grant Program

New York State recently announced a new $75 million competitive grant program, which is part of its Race to the Top plan. In order to receive some of the money, districts must apply, and their applications receive a score between zero and 115. Almost a third of the points (35) are based on proposals for programs geared toward boosting student achievement, 10 points are based on need, and there are 20 possible points awarded for a description of how the proposal fits into districts’ budgets.

The remaining 50 points – almost half – of the application is based on “academic performance” over the prior year. Four measures are used to produce the 0-50 point score: One is the year-to-year change (between 2010 and 2011) in the district’s graduation rate, and the other three are changes in the state “performance index” in math, English Language Arts (ELA) and science. The “performance index” in these three subjects is calculated using a simple weighting formula that accounts for the proportion of students scoring at levels 2 (basic), 3 (proficient) and 4 (advanced).

The idea of using testing results as a criterion in the awarding of grants is to reward those districts that are performing well. Unfortunately, due to the choice of measures and how they are used, the 50 points will be biased and to no small extent based on chance.