Understanding omitted-variable bias

academics College econometrics economics

By Valentine

One of the most important equations in econometrics – and in economics in general – is the equation for omitted-variable bias. This simple equation is a powerful tool for reasoning about the ways in which correlations we see in the data may differ from the causal relationships we care about. In this post, we'll begin by learning exactly what the omitted-variable bias equation is and where it comes from. Then we'll learn why the equation is so important and how to use it when interpreting data.

The Omitted-Variable Bias Equation

The omitted-variable bias equation relates the coefficients from three different linear regressions. It applies no matter what the dependent and independent variables are, but to make things concrete, let's focus on the relationship between income and education.

First, let's consider estimating a simple regression of individual income on years of school completed:

Screen Shot 2023-11-21 at 10.26.41 AM

This regression relates the income of person i to the number of years they were in school. We'll call this regression the "short regression" because there's only one independent variable, School_i.

Now suppose we also observe data on whether person i was born in a high-income ZIP Code. Now we could estimate a linear regression model that controls for each person's background:

Screen Shot 2023-11-21 at 10.27.37 AM

Here, HighIncomeZip_i is a binary variable that equals 1 if person i was born in a high-income ZIP Code. Because we've added another independent variable, we'll call this the "long regression.”

Finally, let's consider estimating a regression that relates each person's background to the number of years they go to school:

Screen Shot 2023-11-21 at 10.56.22 AM

This regression is called the "auxiliary regression".

We're now ready to define the omitted-variable bias equation! The equation is:

Screen Shot 2023-11-21 at 10.56.56 AM

In words, the equation says that the coefficient on School_i from the short regression is equal to the coefficient from the long regression plus a bias term. The size and sign of the bias depends on the relationship between School_i and the outcome, Income_i, as well as the relationship between School_i and the omitted variable, HighIncomeZip_i.

This equation is simple enough to memorize, but it's worth taking a moment to think about why the expression makes sense. From our long regression, we know that being born in a high-income ZIP Code is associated with an increase in income of 3. And from the auxiliary regression, we know that each additional year of school is associated with an increased probability of being from a high-income ZIP Code of 1 percentage points. So if we don't control for HighIncomeZIP_i, then an additional year of school is associated with an increase in income that is directly related to school, 1 and an increase in income that comes from the fact that one more year of school is associated with an increase in HighIncomeZIP_i of 1. In other words, omitting HighIncomeZIP_i leads to some of the relationship between a person's background and their income to be attributed to their years of school.

As a final step to understanding the equation, we can derive it from our three regression equations. We start by substituting the auxiliary regression into the long regression and rearranging:

Screen Shot 2023-11-21 at 10.58.57 AM

Notice that substituting in the auxiliary regression eliminates the variable HighIncomeZIP_i, leaving us with School_i as the only independent variable, just like in the short regression. By rearranging the expression, we have three distinct sets of terms: a constant, the variable School_i multiplied by a slope coefficient, and a residual term. This mirrors the structure of the short regression equation, and the different terms are in fact exactly identical to what you would get if you simply estimated the short regression.

Using the Omitted-Variable Bias Equation: Correlation and Causation

The omitted-variable bias equation describes a general statistical relationship between any short and long regression. But the equation is most powerful when used to think about a short regression that describes a correlation in the data and a long regression that describes a causal relationship. Let's stick with our example of education and income, specifying a short, long, and auxiliary regression, but this time let's consider a different omitted variable:

Screen Shot 2023-11-21 at 12.12.38 PM

Now our omitted variable is EarningsPotential_i, which measures how much money person i would earn if they only stayed in school long enough to graduate from high school. It's reasonable to think that if we could control for an individual's earnings potential, then the relationship we would observe between income and school would be causal. That is, the regression coefficient y₁ represents the average causal effect of an additional year of school on earnings!

So, what does the omitted-variable bias equation say in this case? Writing out the expression, we have:

Screen Shot 2023-11-21 at 12.14.25 PM

That is, the correlation between education and income that we see in the data represents the causal impact of additional years of school on income and a bias term. What's more, we can use this expression to reason about the direction of the bias. To do so, we need only ask ourselves what we think the sign of the coefficients y₂and p₁ are. So, what do you think? Is earnings potential positively or negatively correlated with income? Is it positively or negatively correlated with years of schooling?

If you suspect that people with higher earnings potential make more money and attend school longer, then both y₂ and p₁ are positive, which means that the bias, y₂ x p₁, is positive. That is, the relationship we see in the data between school and income is biased upwards relative to the true causal effect, since people who stay in school longer also have higher earnings potential. On the other hand, if you think that people with higher earnings potential make more money but complete less school (maybe because they drop out to start a successful tech company), then p₁ is negative and the bias, y₂ x p₁, is also negative. So the relationship we see in the data between school and income is biased downward relative to the true causal effect, since people who stay in school longer have lower earnings potential.

One powerful thing about using the omitted-variable bias equation in this way is that it allows you to think about omitted variables that you don't actually observe in your data and that you never would be able to observe in real data. EarningsPotential_i is a perfect example of this -- there is no dataset of hypothetical earnings for people if they only finished high school. But we can still reason about how this unobservable variable would be correlated with the variables that we do observe, and this helps us understand how our regression estimates differ from the causal effects we're interested in.

Practice

For economists, applying the omitted-variable bias equation to think about how a regression estimate might be biased is second nature, but it comes from years of practice. Consider the following patterns in the data and try using the omitted-variable bias equation to reason about whether these patterns represent causal relationships or not. If not, do you think the true causal effect is larger or smaller than the observed correlations?

People in the hospital on average have worse health than people who aren't in the hospital.
Drinking green smoothies is associated with better heart health.
Students who go to a professor's office hours get better grades on average.
States with higher minimum wages have lower average unemployment.
Neighborhood housing costs tend to go up when new luxury apartments are built.

Did you know we offer tutoring for college students?

Learn more

Comments