Large Political Stones, Methodological Glass Houses
Earlier this summer, the New York City Independent Budget Office (IBO) presented findings from a longitudinal analysis of NYC student performance. That is, they followed a cohort of over 45,000 students from third grade in 2005-06 through 2009-10 (though most results are 2005-06 to 2008-09, since the state changed its definition of proficiency in 2009-10).
The IBO then simply calculated the proportion of these students who improved, declined or stayed the same in terms of the state’s cutpoint-based categories (e.g., Level 1 ["below basic" in NCLB parlance], Level 2 [basic], Level 3 [proficient], Level 4 [advanced]), with additional breakdowns by subgroup and other variables.
The short version of the results is that almost two-thirds of these students remained constant in their performance level over this time period – for instance, students who scored at Level 2 (basic) in third grade in 2006 tended to stay at that level through 2009; students at the “proficient” level remained there, and so on. About 30 percent increased a category over that time (e.g., going from Level 1 to Level 2).
The response from the NYC Department of Education (NYCDOE) was somewhat remarkable. It takes a minute to explain why, so bear with me.
The city was put on the political defensive by the coverage of the report’s finding that most students remained constant in their performance levels. A spokesperson fought back, saying (see here) that the analysis was “invalid," and adding, "We are surprised the IBO would issue results with this fundamental flaw."
Put simply, the "fundamental flaw" to which the city refers is that the categories (proficient, advanced, etc.) are not comparable between grades. For example, let’s say a student is proficient in third grade in one year, and then again scores in the proficient range the next year, in fourth grade. Comparing those two results, as did the IBO, means you’re equating proficiency between grades (and years), even though they are two different tests, and two different definitions of proficiency.
On this point, the NYCDOE is correct (though it's more of a strong caveat than a "fundamental flaw," and the IBO report is quite explicit about its limitations). The definition of proficiency (and other categories) is subject to much imprecision, and cannot easily be compared between grades.
There are, however, a couple of reasons why it's the city's harsh dismissive response to the IBO report, rather than the IBO report itself, that is surprising.
The first is that NYC used this same method - grade-to-grade proficiency comparisons - in their old school grading system. Prior to the 2009-10 school year, 60 percent of schools' ratings were based on a measure described by the city as follows: “The yearly gain or loss in ELA and mathematics proficiency of the same students as they move from one grade to the next at the school."
Students who remained in the same proficiency category between years were counted as "making a year's worth of progress," and schools were judged based in part on how many did so. This measure is for all intents and purposes the same as the IBO's, and while I agree it's a limited indicator, I'm uncertain how it is "invalid" in the report but not in the school ratings.*
The second issue one might take with the city's response is that the "progress" measure upon which city officials do rely in their public statements is arguably more flawed than the IBO's, and in some respects it too implies a comparison of proficiency between grades.
As I’ve discussed many times, when NYC presents its testing results every year, they focus almost entirely on changes in overall proficiency rates in math and reading. For example, between 2011 and 2012, the city’s overall math proficiency increased 3.3 percentage points, from 54.0 percent to 57.3 percent. Properly interpreted, these data can tell you whether a larger proportion of students in one year (2012) met the grade-level proficiency standard than students the previous year (2011).
In contrast, the NYCDOE press release called the change “significant progress." Mayor Michael Bloomberg declared that “there is no question our students are moving in the right direction."
Since, however, the data are cross-sectional - i.e., unlike the IBO analysis, they do not follow students over time - you're comparing two different groups of students. Changes in proficiency rates, especially small changes, might easily be due to this sampling variation, even in large districts. Thus, the increase in math rates between 2011 and 2012 was neither "significant" nor "progress" per se, and presenting it as such is no less flawed than anything in the IBO report.
Moreover, think about what it means to portray these rate changes as "progress" or evidence that "students are moving in the right direction." It clearly implies that a group of students improved over time - i.e., a bunch who weren't proficient in 2011 did meet this standard in 2012. This interpretation reflects an implicit comparison of the definition of “proficient” between grades, because the "improving" students are taking the test in a higher grade. That's precisely what the city objected to in the IBO report.
Now, don't get me wrong - I don't think the IBO results are by themselves evidence that city students made little progress between 2006 and 2009.
As fully acknowledged in the report itself, the results really must be interpreted with caution, not just because the definition of proficiency is difficult to compare between grades, but for other reasons as well, such as the fact that the IBO limited their sample to a single cohort of students who were enrolled in city schools in all the years of data they used, while excluding many students (e.g., those with special testing arrangements, those who moved in or out of the district during this time).
In addition, students can make substantial progress (or the opposite) while still remaining within the same proficiency category, and thus characterizing their performance as flat may be misleading (especially since 30 percent of students did actually move up a category).
Instead of lashing out and perpetuating the politicization of testing results, the city might have simply acknowledged that the IBO report, even though it is limited in the degree to which it can tell us whether students made progress over this time period, is a worthwhile addition to the public reports on NYC testing data, especially given the scarcity of longitudinal analyses.
Still, it’s good to see the nation’s largest school district highlight the “fundamental flaws” of aggregate proficiency rate changes as measures of student improvement, and let’s hope their future communications about testing results are clear about these limitations.
- Matt Di Carlo
*****
* To the city’s credit, they adopted a new growth model after the 2008-09 school year (though other states, such as Florida, still incorporate between-grade proficiency comparisons in their systems).
Are these data "valid and reliable" measurements of year to year proficiency of teachers?
Can you tell me where the major analysis by Thomas Kane and another man is that shows normal variation in test scores? I was reading it and want to give it to our staff but lost the link.
Thank you
Ann Sutherland, Ph.D.
Trustee Fort Worth ISD