• The Big Story About Gender Gaps In Test Scores

    The OECD recently published a report about differences in test scores between boys and girls on the Programme for International Student Assessment (PISA), which is a test of 15 year olds conducted every three years in multiple subjects. The main summary finding is that, in most nations, girls are significantly less likely than boys to score below the “proficient” threshold in all three subjects (math, reading and science). The report also includes survey items and other outcomes.

    First, it is interesting to me how discussions of these gender gaps differ from those about gaps between income or ethnicity groups. Specifically, when we talk about gender gaps, we interpret them properly – as gaps in measured performance between groups of students. Any discussion of gaps between groups defined in terms of income or ethnicity, on the other hand, are almost always framed in terms of school performance.

    This is partially because schools in the U.S. are segregated by income and ethnicity, but not really by gender, and also because some folks have a tendency to overestimate the degree to which income- and ethnicity-based achievement gaps stem from systematic variation in schooling inputs, whereas in reality they are more a function of non-school factors (though, of course, schools matter, and differences in school quality reinforce the non-school-based impact). That said, returning to the findings of this report, I was slightly concerned with how, in some cases, they were reported in the media.

  • Teacher Quality - Still Plenty Of Room For Debate

    On March 3, the New York Times published one of their “Room for Debate” features, in which panelists were asked "How To Ensure and Improve Teacher Quality?" When I read through the various perspectives, my first reaction was: "Is that it?"

    It's not that I don't think there is value in many of the ideas presented -- I actually do. The problem is that there are important aspects of teacher quality that continue to be ignored in policy discussions, despite compelling evidence suggesting that they matter in the quality equation. In other words, I wasn’t disappointed with what was said but, rather, what wasn’t. Let’s take a look at the panelists’ responses after making a couple of observations on the actual question and issue at hand.

    The first thing that jumped out at me is that teacher quality is presented in a somewhat decontextualized manner. Teachers don't work in a vacuum; quality is produced in specific settings. Placing the quality question in context can help to broaden the conversation to include: 1) the role of the organization in shaping educator learning and effectiveness; and 2) the shining of light on the intersection between teachers and schools and the vital issue of employee-organization "fit."

    Second, the manner in which teacher quality is typically framed -- including in the Times question -- suggests that effectiveness is a (fixed) individual attribute (i.e., human capital) that teachers carry with them across contexts (i.e., it's portable). In reality, however, it is context-dependent and can be (and is indeed) developed among individuals -- as a result of their networks, their professional interactions, and their shared norms and trust (i.e., social capital). In sum, it's not just what teachers know but who they know and where they work -- as well as the interaction of these three.

  • The Smoke And The Fire From Evaluations Of Teach For America

    A recent study by the always reliable research organization Mathematica takes a look at the characteristics and test-based effectiveness of Teach For America (TFA) teachers who were recruited as part of a $50 million federal “Investing in Innovation” grant, which is supporting a substantial scale-up of TFA’s presence in U.S. public schools.

    The results of this study pertain to a small group of recruits (and comparison non-TFA teachers) from the first two years of the program – i.e., a sample of 156 PK-5 teachers (66 TFA and 90 non-TFA) in 36 schools spread throughout 10 states. What distinguishes the analysis methodologically is that it exploits the random assignment of students to teachers in these schools, which ensures that any measured differences between TFA and comparison teachers are not due to unobserved differences in the students they are assigned to teach.

    The Mathematica researchers found, in short, that the estimated differences in the impact of TFA and comparison teachers on math and reading scores across all grades were modest in magnitude and not statistically discernible at any conventional level. There were, however, meaningful positive estimated differences in the earliest grades (PK-2), though they were only statistically significant in reading, and the coefficient in reading for grades 3-5 was negative (and not significant). Let’s take a quick look at these and other findings from this report and what they might mean.

  • How Not To Improve New Teacher Evaluation Systems

    One of the more interesting recurring education stories over the past couple of years has been the release of results from several states’ and districts’ new teacher evaluation systems, including those from New York, Indiana, Minneapolis, Michigan and Florida. In most of these instances, the primary focus has been on the distribution of teachers across ratings categories. Specifically, there seems to be a pattern emerging, in which the vast majority of teachers receive one of the higher ratings, whereas very few receive the lowest ratings.

    This has prompted some advocates, and even some high-level officials, essentially to deem as failures the new systems, since their results suggest that the vast majority of teachers are “effective” or better. As I have written before, this issue cuts both ways. On the one hand, the results coming out of some states and districts seem problematic, and these systems may need adjustment. On the other hand, there is a danger here: States may respond by making rash, ill-advised changes in order to achieve “differentiation for the sake of differentiation,” and the changes may end up undermining the credibility and threatening the validity of the systems on which these states have spent so much time and money.

    Granted, whether and how to alter new evaluations are difficult decisions, and there is no tried and true playbook. That said, New York Governor Andrew Cuomo’s proposals provide a stunning example of how not to approach these changes. To see why, let’s look at some sound general principles for improving teacher evaluation systems based on the first rounds of results, and how they compare with the New York approach.*

  • The Status Fallacy: New York State Edition

    A recent New York Times story addresses directly New York Governor Andrew Cuomo’s suggestion, in his annual “State of the State” speech, that New York schools are in a state of crisis and "need dramatic reform." The article’s general conclusion is that the “data suggest otherwise.”

    There are a bunch of important points raised in the article, but most of the piece is really just discussing student rather than school performance. Simple statistics about how highly students score on tests – i.e., “status measures” – tell you virtually nothing about the effectiveness of the schools those students attend, since, among other reasons, they don’t account for the fact that many students enter the system at low levels. How much students in a school know in a given year is very different from how much they learned over the course of that year.

    I (and many others) have written about this “status fallacy” dozens of times (see our resources page), not because I enjoy repeating myself (I don’t), but rather because I am continually amazed just how insidious it is, and how much of an impact it has on education policy and debate in the U.S. And it feels like every time I see signs that things might be changing for the better, there is an incident, such as Governor Cuomo’s speech, that makes me question how much progress there really has been at the highest levels.

  • Turning Conflict Into Trust Improves Schools And Student Learning

    Our guest author today is Greg Anrig, vice president of policy and programs at The Century Foundation and author of Beyond the Education Wars: Evidence That Collaboration Builds Effective Schools.

    In recent years, a number of studies (discussed below; also see here and here) have shown that effective public schools are built on strong collaborative relationships, including those between administrators and teachers. These findings have helped to accelerate a movement toward constructing such partnerships in public schools across the U.S. However, the growing research and expanding innovations aimed at nurturing collaboration have largely been neglected by both mainstream media and the policy community.

    Studies that explore the question of what makes successful schools work never find a silver bullet, but they do consistently pinpoint commonalities in how those schools operate. The University of Chicago's Consortium on Chicago School Research produced the most compelling research of this type, published in a book called Organizing Schools for Improvement. The consortium gathered demographic and test data, and conducted extensive surveys of stakeholders, in more than 400 Chicago elementary schools from 1990 to 2005. That treasure trove of information enabled the consortium to identify with a high degree of confidence the organizational characteristics and practices associated with schools that produced above-average improvement in student outcomes.

    The most crucial finding was that the most effective schools, based on test score improvement over time after controlling for demographic factors, had developed an unusually high degree of "relational trust" among their administrators, teachers, and parents.

  • Actual Growth Measures Make A Big Difference When Measuring Growth

    As a frequent critic of how states and districts present and interpret their annual testing results, I am also obliged (and indeed quite happy) to note when there is progress.

    Recently, I happened to be browsing through New York City’s presentation of their 2014 testing results, and to my great surprise, on slide number four, I found proficiency rate changes between 2013 and 2014 among students who were in the sample in both years (which they call “matched changes”). As it turns out, last year, for the first time, New York State as a whole began publishing these "matched" year-to-year proficiency rate changes for all schools and districts. This is an excellent policy. As we’ve discussed here many times, NCLB-style proficiency rate changes, which compare overall rates of all students, many of whom are only in the tested sample in one of the years, are usually portrayed as “growth” or “progress.” They are not. They compare different groups of students, and, as we’ll see, this can have a substantial impact on the conclusions one reaches from the data. Limiting the sample to students who were tested in both years, though not perfect, at least permits one to measure actual growth per se, and provides a much better idea of whether students are progressing over time.

    This is an encouraging sign that New York State is taking steps to improve the quality and interpretation of their testing data. And, just to prove that no good deed goes unpunished, let’s see what we can learn using the new “matched” data – specifically, by seeing how often the matched (longitudinal) and unmatched (cross-sectional) changes lead to different conclusions about student “growth” in schools.

  • Sample Size And Volatility In School Accountability Systems

    It is generally well-known that sample size has an important effect on measurement and, therefore, incentives in test-based school accountability systems.

    Within a given class or school, for example, there may be students who are sick on testing day, or get distracted by a noisy peer, or just have a bad day. Larger samples attenuate the degree to which unusual results among individual students (or classes) can influence results overall. In addition, schools draw their students from a population (e.g., a neighborhood). Even if the characteristics of the neighborhood from which the students come stay relatively stable, the pool of students entering the school (or tested sample) can vary substantially from one year to the next, particularly when that pool is small.

    Classes and schools tend to be quite small, and test scores vary far more between- than within-student (i.e., over time). As a result, testing results often exhibit a great deal of nonpersistent variation (Kane and Staiger 2002). In other words, much of the differences in test scores between schools, and over time, is fleeting, and this problem is particularly pronounced in smaller schools. One very simple, though not original, way to illustrate this relationship is to compare the results for smaller and larger schools.

  • Preparing Effective Teachers For Every Community

    Our guest authors today are Frank Hernandez, Corinne Mantle-Bromley and Benjamin Riley. Dr. Hernandez is the dean of the College of Education at the University of Texas of the Permian Basin, and previously served as a classroom teacher and school and district administrator for 12 years. Dr. Mantle-Bromley is dean of the University of Idaho’s College of Education and taught in rural Idaho prior to her work preparing teachers for diverse K-12 populations. Mr. Riley is the founder of Deans for Impact, a new organization composed of deans of colleges of education working together to transform educator preparation in the US. 

    Students of color in the U.S., and those who live in rural communities, face unique challenges in receiving a high-quality education. All too often, new teachers have been inadequately prepared for these students’ specific needs. Perhaps just as often, their teachers do not look like them, and do not understand the communities in which these students live. Lacking an adequate preparation and the cultural sensitivities that come only from time and experience within a community, many of our nation’s teachers are thrust into an almost unimaginably challenging situation. We simply do not have enough well-prepared teachers of color, or teachers from rural communities, who can successfully navigate the complexities of these education ecosystems.

    Some have described the lack of teachers of color and teachers who will serve in rural communities as a crisis of social justice. We agree. And, as the leaders of two colleges of education who prepare teachers who serve in these communities, we think the solution requires elevating the expectations for every program that prepares teachers and educators in this country.

  • The Debate And Evidence On The Impact Of NCLB

    There is currently a flurry of debate focused on the question of whether “NCLB worked.” This question, which surfaces regularly in the education field, is particularly salient in recent weeks, as Congress holds hearings on reauthorizing the law.

    Any time there is a spell of “did NCLB work?” activity, one can hear and read numerous attempts to use simple NAEP changes in order to assess its impact. Individuals and organizations, including both supporters and detractors of the law, attempt to make their cases by presenting trends in scores, parsing subgroups estimates, and so on. These efforts, though typically well-intentioned, do not, of course, tell us much of anything about the law’s impact. One can use simple, unadjusted NAEP changes to prove or disprove any policy argument. And the reason is that they are not valid evidence of an intervention's effects. There’s more to policy analysis than subtraction.

    But it’s not just the inappropriate use of evidence that makes these “did NCLB work?” debates frustrating and, often, unproductive. It is also the fact that NCLB really cannot be judged in simple, binary terms. It is a complex, national policy with considerable inter-state variation in design/implementation and various types of effects, intended and unintended. This is not a situation that lends itself to clear cut yes/no answers to the “did it work?” question.