Gender Pay Gaps And Educational Achievement Gaps

There is currently an ongoing rhetorical war of sorts over the gender wage gap. One “side” makes the common argument that women earn around 75 cents on the male dollar (see here, for example).

Others assert that the gender gap is a myth, or that it is so small as to be unimportant.

Often, these types of exchanges are enough to exasperate the casual observer, and inspire claims such as “statistics can be made to say anything." In truth, however, the controversy over the gender gap is a good example of how descriptive statistics, by themselves, say nothing. What matters is how they’re interpreted.

Moreover, the manner in which one must interpret various statistics on the gender gap applies almost perfectly to the achievement gaps that are so often mentioned in education debates.

The Landmark Case Of Us V. Them

Patrick Riccards, CEO of the education advocacy group ConnCAN, has published a short piece on his personal blog in which he decries the “vicious and fact-free attacks” in education debates.

The post lists a bunch of “if/then” statements to illustrate how market-based reform policy positions are attacked on personal grounds, such as, “If one provides philanthropic support to improve public schools, then one must be a profiteer looking to make personal fortunes off public education." He summarizes the situation with a shot of his own: “Yes, there are no attacks that are too vicious or too devoid of fact for the defenders of the status quo." What of his fellow reformers? They “simply have to stand and take the attacks and the vitriol, no matter how ridiculous."

Mr. Riccards is dead right that name-calling, ascription of base motives, and the abuse of empirical evidence are rampant in education debates. I myself have criticized the unfairness of several of his “if/then” statements, including the accusations of profiteering and equating policy views with being “anti-teacher."

But anyone who thinks that this behavior is concentrated on one “side” or the other must be wearing blinders.

Teachers: Pressing The Right Buttons

The majority of social science research does not explicitly dwell on how we go from situation A to situation B. Instead, most social scientists focus on associations between different outcomes. This “static” approach has advantages but also limitations. Looking at associations might reveal that teachers who experience condition A are twice as likely to leave their schools than teachers who experience condition B. But what does this knowledge tell us about how to move from condition A to condition B? In many cases, very little.

Many social science findings are not easily “actionable” for policy purposes precisely because they say nothing about processes or sequences of events and activities unfolding over time, and in context. While conventional quantitative research provides indications of what works — on average — across large samples, a look at processes reveals how factors or events (situated in time and space) are associated with each other. This kind of research provides the detail that we need, not just to understand the world, but to do so in a way that is useful and enables us to act on it constructively.

Although this kind of work is rare, every now then a quantitative study showing “process sensitivity” sees the light of day. This is the case of a recent paper by Morgan and colleagues (2010) examining how the events that teachers experience routinely affect their commitment to remain in the profession.

A Game Of Inches

One of the more telling episodes in education I’ve seen over the past couple of years was a little dispute over Michelle Rhee’s testing record that flared up last year. Alan Ginsburg, a retired U.S. Department of Education official, released an informal report in which he presented the NAEP cohort changes that occurred during the first two years of Michelle Rhee’s tenure (2007-2009), and compared them with those during the superintendencies of her two predecessors.

Ginsburg concluded that the increases under Chancellor Rhee, though positive, were less rapid than in previous years (2000 to 2007 in math, 2003 to 2007 in reading). Soon thereafter, Paul Peterson, director of Harvard’s Program on Educational Leadership and Governance, published an article in Education Next that disputed Ginsburg’s findings. Peterson found that increases under Rhee amounted to roughly three scale score points per year, compared with around 1-1.5 points annually between 2000 and 2007 (the actual amounts varied by subject and grade).

Both articles were generally cautious in tone and in their conclusions about the actual causes of the testing trends. The technical details of the two reports – who’s “wrong” or “right” - are not important for this post (especially since more recent NAEP results have since been released). More interesting was how people reacted - and didn’t react - to the dueling analyses.

We Should Only Hold Schools Accountable For Outcomes They Can Control

Let’s say we were trying to evaluate a teacher’s performance for this academic year, and part of that evaluation would use students’ test scores (if you object to using test scores this way, put that aside for a moment). We checked the data and reached two conclusions. First, we found that her students made fantastic progress this year. Second, we also saw that the students’ scores were still quite a bit lower than their peers’ in the district. Which measure should we use to evaluate this teacher?

Would we consider judging her even partially based on the latter – students’ average scores? Of course not. Those students made huge progress, and the only reason their absolute performance levels are relatively low is because they were low at the beginning of the year. This teacher could not control the fact that she was assigned lower-scoring students. All she can do is make sure that they improve. That’s why no teacher evaluation system places any importance on students’ absolute performance, instead focusing on growth (and, of course, non-test measures). In fact, growth models control for absolute performance (prior year’s test scores) so it doesn't bias the results.

If we would never judge teachers based on absolute performance, why are we judging schools that way? Why does virtually every school/district rating system place some emphasis – often the primary emphasis – on absolute performance?

Three Important Distinctions In How We Talk About Test Scores

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood - certainly in the education policy world, and I would say among the public as well - that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

Herding FCATs

About a week ago, Florida officials went into crisis mode after revealing that the proficiency rate on the state’s writing test (FCAT) dropped from 81 percent to 27 percent among fourth graders, with similarly large drops in the other two grades in which the test is administered (eighth and tenth). The panic was almost immediate. For one thing, performance on the writing FCAT is counted in the state’s school and district ratings. Many schools would end up with lower grades and could therefore face punitive measures.

Understandably, a huge uproar was also heard from parents and community members. How could student performance decrease so dramatically? There was so much blame going around that it was difficult to keep track – the targets included the test itself, the phase-in of the state’s new writing standards, and test-based accountability in general.

Despite all this heated back-and-forth, many people seem to have overlooked one very important, widely-applicable lesson here: That proficiency rates, which are not "scores," are often extremely sensitive to where you set the bar.

Quality Control In Charter School Research

There's a fairly large body of research showing that charter schools vary widely in test-based performance relative to regular public schools, both by location as well as subgroup. Yet, you'll often hear people point out that the highest-quality evidence suggests otherwise (see here, here and here) - i.e., that there are a handful of studies using experimental methods (randomized controlled trials, or RCTs) and these analyses generally find stronger, more uniform positive charter impacts.

Sometimes, this argument is used to imply that the evidence, as a whole, clearly favors charters, and, perhaps by extension, that many of the rigorous non-experimental charter studies - those using sophisticated techniques to control for differences between students - would lead to different conclusions were they RCTs.*

Though these latter assertions are based on a valid point about the power of experimental studies (the few of which we have are often ignored in the debate over charters), they are dubiously overstated for a couple of reasons, discussed below. But a new report from the (indispensable) organization Mathematica addresses the issue head on, by directly comparing estimates of charter school effects that come from an experimental analysis with those from non-experimental analyses of the same group of schools.

The researchers find that there are differences in the results, but many are not statistically significant and those that are don't usually alter the conclusions. This is an important (and somewhat rare) study, one that does not, of course, settle the issue, but does provide some additional tentative support for the use of strong non-experimental charter research in policy decisions.

Growth And Consequences In New York City's School Rating System

In a New York Times article a couple of weeks ago, reporter Michael Winerip discusses New York City’s school report card grades, with a focus on an issue that I have raised many times – the role of absolute performance measures (i.e., how highly students scores) in these systems, versus that of growth measures (i.e., whether students are making progress).

Winerip uses the example of two schools – P.S. 30 and P.S. 179 – one of which (P.S. 30) received an A on this year’s report card, while the other (P.S. 179) received an F. These two schools have somewhat similar student populations, at least so far as can be determined using standard education variables, and their students are very roughly comparable in terms of absolute performance (e.g., proficiency rates). The basic reason why one received an A and the other an F is that P.S. 179 received a very low growth score, and growth is heavily weighted in the NYC grade system (representing 60 out of 100 points for elementary and middle schools).

I have argued previously that unadjusted absolute performance measures such as proficiency rates are inappropriate for test-based assessments of schools' effectiveness, given that they tell you almost nothing about the quality of instruction schools provide, and that growth measures are the better option, albeit one that also has its own issues (e.g., they are more unstable), and must be used responsibly. In this sense, the weighting of the NYC grading system is much more defensible than most of its counterparts across the nation, at least in my view.

But the system is also an example of how details matter – each school’s growth portion is calculated using an unconventional, somewhat questionable approach, one that is, as yet, difficult to treat with a whole lot of confidence.

The Weighting Game

A while back, I noted that states and districts should exercise caution in assigning weights (importance) to the components of their teacher evaluation systems before they know what the other components will be. For example, most states that have mandated new evaluation systems have specified that growth model estimates count for a certain proportion (usually 40-50 percent) of teachers’ final scores (at least those in tested grades/subjects), but it’s critical to note that the actual importance of these components will depend in no small part on what else is included in the total evaluation, and how it's incorporated into the system.

In slightly technical terms, this distinction is between nominal weights (the percentage assigned) and effective weights (the percentage that actually ends up being the case). Consider an extreme hypothetical example – let’s say a district implements an evaluation system in which half the final score is value-added and half is observations. But let’s also say that every teacher gets the same observation score. In this case, even though the assigned (nominal) weight for value-added is 50 percent, the actual importance (effective weight) will be 100 percent, since every teacher receives the same observation score, and so all the variation between teachers’ final scores will be determined by the value-added component.

This issue of nominal/versus effective weights is very important, and, with exceptions, it gets almost no attention. And it’s not just important in teacher evaluations. It’s also relevant to states’ school/district grading systems. So, I think it would be useful to quickly illustrate this concept in the context of Florida’s new district grading system.