In Research, What Does A "Significant Effect" Mean?

If you follow education research – or quantitative work in any field – you’ll often hear the term “significant effect." For example, you will frequently read research papers saying that a given intervention, such as charter school attendance or participation in a tutoring program, had “significant effects," positive or negative, on achievement outcomes.

This term by itself is usually sufficient to get people who support the policy in question extremely excited, and to compel them to announce boldly that their policy “works." They’re often overinterpreting the results, but there’s a good reason for this. The problem is that “significant effect” is a statistical term, and it doesn’t always mean what it appears to mean. As most people understand the words, “significant effects” are often neither significant nor necessarily effects.

Let’s very quickly clear this up, one word at a time, working backwards.

In education research, the term “effect” usually refers to an estimate from a model, such as a regression. For example, I might want to see how education influences income, but, in order to isolate this relationship, I need to control for other factors that also affect income, such as industry and experience. Put more simply, I want to look at the average relationship between education and income among people who have the same level of experience, work in the same industry and share other characteristics that shape income. That quantified relationship – usually controlling for a host of different variables - is often called an "effect."

But we can’t randomly assign education to people the way we would a pharmaceutical drug. And there are dozens of interrelated variables that might affect income, many of which, such as ability or effort, can’t even be measured directly.

In good models using large, detailed datasets with a thorough set of control variables, a statistically significant “effect” might serve as pretty good tentative evidence that there is a causal relationship between two variables – e.g., that having more education leads to higher earnings, at least to some degree, all else being equal. Sometimes, it’s even possible for social scientists to randomly assign “treatment” (e.g., merit pay programs), or exploit this when it happens (e.g., charter school lotteries). One can be relatively confident that the results from studies using random assignment, assuming they're well-executed, are not only causal per se, but also less likely to reflect bias from unmeasured influences. Even in these cases, however, there are usually validity-related questions left open, such as whether a program’s effect in one context/location will be the same elsewhere.

So, in general, when you hear about “effects," especially those estimated without the benefit of random assignment, it's best to think of them as relationships or associations that are often (but not nearly always) causal to some extent, though the estimate of that association’s size varies in its precision, and the degree to which it reflects the influence of unmeasured factors.

Then there’s the term “significant." “Significant” is of course a truncated form of “statistically significant." Statistical significance means we can be confident that a given relationship is not zero. That is, the relationship or difference is probably not just random “noise." A significant effect can be either positive (we can be confident it’s greater than zero) or negative (we can be confident it’s less than zero). In other words, it is “significant” insofar as it’s not nothing. The better way to think about it is “discernible." There’s something there.

In our education/income example, a “significant positive effect” of education on income means that one can be confident that, on average, more educated people earn more than people with less education, even when we control for experience, industry and, presumably, a bunch of other variables that might be associated with income.

(Side note: One can also test for statistical significance of simpler relationships that are not properly called "effects," such as whether there is a difference between test scores in one year compared with a prior year.)

Most importantly, as I mentioned in a previous post, an “effect” that is statistically significant is not necessarily educationally meaningful. Remember – significant means that the relationship is not zero, but that doesn’t mean it’s big or even moderate. Quite often, “significant” effects are so small as to be rather meaningless, especially when using big datasets. You need to check the size of the "effect," the proper interpretation of which depends on the outcome used, the type and duration of “treatment” in question and other factors.

For example, today's NAEP results indicated a "significant increase" in fourth and eighth grade math and eighth grade reading, but in all three cases, the increase was as modest as it gets - just one scale score point, roughly a month of "learning." Certainly, this change warrants attention, but it may not square with most people's definition of "significant" (and it may also reflect differences in the students taking the test).

So, summing up, a when you hear that something has a “statistically significant effect” on something else, remember that it’s not necessarily significant or an effect in the common use of those words. It’s best to think of them as “statistically discernible relationships." They can be big or small, they’re not necessarily causal, and they can vary widely in terms of precision.

- Matt Di Carlo

Blog Topics

Great post.
I think another thing to consider, unfortunately, is that interpretation of effects often requires some background knowledge of the phenomena being studied. Even other scientists, who can look for effect size, and know that when you have census data, everything is "significant," can still misinterpret things like NAEP scores, where one needs to have an understanding of the history of the test, the content, and the testing situation.
But definitely, I am on board with urging caution and humility in interpreting effects.

Thanks for explaining this, I can use this information in conversation and debates. I even have to keep reminding myself not to fall for the rhetoric of people using their ideology to mislead the public with authoritarian, "scientific facts".

And - the size of an effect shouldn't be compared to doing nothing in a schools context as an acton is invariably taken instead of some other possible approach. Educational researchers ought to care about an effect that is greater than the stuff a teacher/school normally does. It's unusual for a set of teachers to do something new and it not to have an effect of some kind. The important question is; is this approach better than the other possible approaches we know about for improving boys reading, for example. Comparison with best known strategy is what good medical research undertakes. There really isn't any point in demonstrating that a particular strategy improves boys reading unless the strategy is better than the best one presently known.

A look at what makes the effect size meaningful (as opposed to significant) one may check here: http://www2.ed.gov/rschstat/eval/choice/implementation/achievementanaly…
and here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.8383&rep=r…

And even when some studies do everything right, and do find significant effects, whether that makes a big difference in reality is another question altogether. Check this one.
http://ies.ed.gov/ncee/pubs/20114001/pdf/20114002.pdf

What's your view on Professor Hattie's table 'effect sizes'? Clearly he has gone further than claiming interventions are more then merely 'significant effects' and has tried to quantify these effects. I can accept that there are limits to the type of meta-analyses he conducts, but does that mean we can't trust his research at all? How else should we make decisions on what's effective and what's not?