Interpreting Effect Sizes In Education Research

Interpreting “effect sizes” is one of the trickier checkpoints on the road between research and policy. Effect sizes, put simply, are statistics measuring the size of the association between two variables of interest, often controlling for other variables that may influence that relationship. For example, a research study may report that participating in a tutoring program was associated with a 0.10 standard deviation increase in math test scores, even controlling for other factors, such as student poverty, grade level, etc.

But what does that mean, exactly? Is 0.10 standard deviations a large effect or a small effect? This is not a simple question, even for trained researchers, and answering it inevitably entails a great deal of subjective human judgment. Matthew Kraft has an excellent little working paper that pulls together some general guidelines and a proposed framework for interpreting effect sizes in education. 

Before discussing the paper, though, we need to mention what may be one of the biggest problems with the interpretation of effect sizes in education policy debates: They are often ignored completely.

What You Need To Know About Misleading Education Graphs, In Two Graphs

There’s no reason why insisting on proper causal inference can’t be fun.

A weeks ago, ASCD published a policy brief (thanks to Chad Aldeman for flagging it), the purpose of which is to argue that it is “grossly misleading” to make a “direct connection” between nations’ test scores and their economic strength.

On the one hand, it’s implausible to assert that better educated nations aren’t stronger economically. On the other hand, I can certainly respect the argument that test scores are an imperfect, incomplete measure, and the doomsday rhetoric can sometimes get out of control.

In any case, though, the primary piece of evidence put forth in the brief was the eye-catching graph below, which presented trends in NAEP versus those in U.S. GDP and productivity.

Teachers And Education Reform, On A Need To Know Basis

A couple of weeks ago, the website Vox.com published an article entitled, “11 facts about U.S. teachers and schools that put the education reform debate in context." The article, in the wake of the Vergara decision, is supposed to provide readers with the “basic facts” about the current education reform environment, with a particular emphasis on teachers. Most of the 11 facts are based on descriptive statistics.

Vox advertises itself as a source of accessible, essential, summary information -- what you "need to know" -- for people interested in a topic but not necessarily well-versed in it. Right off the bat, let me say that this is an extraordinarily difficult task, and in constructing lists such as this one, there’s no way to please everyone (I’ve read a couple of Vox’s education articles and they were okay).

That said, someone sent me this particular list, and it’s pretty good overall, especially since it does not reflect overt advocacy for given policy positions, as so many of these types of lists do. But I was compelled to comment on it. I want to say that I did this to make some lofty point about the strengths and weaknesses of data and statistics packaged for consumption by the general public. It would, however, be more accurate to say that I started doing it and just couldn't stop. In any case, here’s a little supplemental discussion of each of the 11 items:

What Is A Standard Deviation?

Anyone who follows education policy debates might hear the term “standard deviation” fairly often. Most people have at least some idea of what it means, but I thought it might be useful to lay out a quick, (hopefully) clear explanation, since it’s useful for the proper interpretation of education data and research (as well as that in other fields).

Many outcomes or measures, such as height or blood pressure, assume what’s called a “normal distribution." Simply put, this means that such measures tend to cluster around the mean (or average), and taper off in both directions the further one moves away from the mean (due to its shape, this is often called a “bell curve”). In practice, and especially when samples are small, distributions are imperfect -- e.g., the bell is messy or a bit skewed to one side -- but in general, with many measures, there is clustering around the average.

Let’s use test scores as our example. Suppose we have a group of 1,000 students who take a test (scored 0-20). A simulated score distribution is presented in the figure below (called a "histogram").

Relationship Counseling

A correlation between two variables measures the strength of the linear relationship between them. Put simply, two variables are positively correlated to the extent that individuals with relatively high or low values on one measure tend to have relatively high or low values on the other, and negatively correlated to the extent that high values on one measure are associated with low values on the other.

Correlations are used frequently in the debate about teacher evaluations. For example, researchers might assess the relationship between classroom observations and value-added measures, which is one of the simpler ways to gather information about the “validity” of one or the other – i.e., whether it is telling us what we want to know. In this case, if teachers with higher observation scores also tend to get higher value-added scores, this might be interpreted as a sign that both are capturing, at least to some extent, "true" teacher performance.

Yet there seems to be a tendency among some advocates and policy makers to get a little overeager when interpreting correlations.

Do Charter Schools Serve Fewer Special Education Students?

A new report from the U.S. Government Accountability Office (GAO) provides one of the first large-scale comparisons of special education enrollment between charter and regular public schools. The report’s primary finding, which, predictably, received a fair amount of attention, is that roughly 11 percent of students enrolled in regular public schools were on special education plans in 2009-10, compared with just 8 percent of charter school students.

The GAO report’s authors are very careful to note that their findings merely describe what you might call the “service gap” – i.e., the proportion of special education students served by charters versus regular public schools – but that they do not indicate the reasons for this disparity.

This is an important point, but I would take the warning a step further:  The national- and state-level gaps themselves should be interpreted with the most extreme caution.

In Research, What Does A "Significant Effect" Mean?

If you follow education research – or quantitative work in any field – you’ll often hear the term “significant effect." For example, you will frequently read research papers saying that a given intervention, such as charter school attendance or participation in a tutoring program, had “significant effects," positive or negative, on achievement outcomes.

This term by itself is usually sufficient to get people who support the policy in question extremely excited, and to compel them to announce boldly that their policy “works." They’re often overinterpreting the results, but there’s a good reason for this. The problem is that “significant effect” is a statistical term, and it doesn’t always mean what it appears to mean. As most people understand the words, “significant effects” are often neither significant nor necessarily effects.

Let’s very quickly clear this up, one word at a time, working backwards.

A Below Basic Understanding Of Proficiency

Given our extreme reliance on test scores as measures of educational success and failure, I'm sorry I have to make this point: proficiency rates are not test scores, and changes in proficiency rates do not necessarily tell us much about changes in test scores.

Yet, for example, in the Washington Post editorial about the latest test results from the District of Columbia Public Schools, at no fewer than seven different points (in a 450 word piece) do they refer to proficiency rates (and changes in these rates) as "scores." This is only one example of many.

So, what's the problem?