Understanding omitted-variable bias

academics econometrics economics

One of the most important equations in econometrics – and in economics in general – is the equation for omitted-variable bias. This simple equation is a powerful tool for reasoning about the ways in which correlations we see in the data may differ from the causal relationships we care about. In this post, we'll begin by learning exactly what the omitted-variable bias equation is and where it comes from. Then we'll learn why the equation is so important and how to use it when interpreting data.

The Omitted-Variable Bias Equation

The omitted-variable bias equation relates the coefficients from three different linear regressions. It applies no matter what the dependent and independent variables are, but to make things concrete, let's focus on the relationship between income and education.

First, let's consider estimating a simple regression of individual income on years of school completed:

Screen Shot 2023-11-21 at 10.26.41 AM

This regression relates the income of person i to the number of years they were in school. We'll call this regression the "short regression" because there's only one independent variable, Schooli.

Now suppose we also observe data on whether person i was born in a high-income ZIP Code. Now we could estimate a linear regression model that controls for each person's background:

Screen Shot 2023-11-21 at 10.27.37 AM

Here, HighIncomeZipi is a binary variable that equals 1 if person i was born in a high-income ZIP Code. Because we've added another independent variable, we'll call this the "long regression.”

Finally, let's consider estimating a regression that relates each person's background to the number of years they go to school:

Screen Shot 2023-11-21 at 10.56.22 AM

This regression is called the "auxiliary regression".

We're now ready to define the omitted-variable bias equation! The equation is:

Screen Shot 2023-11-21 at 10.56.56 AM

In words, the equation says that the coefficient on Schooli from the short regression is equal to the coefficient from the long regression plus a bias term. The size and sign of the bias depends on the relationship between Schooli and the outcome, Incomei, as well as the relationship between Schooli and the omitted variable, HighIncomeZipi.

This equation is simple enough to memorize, but it's worth taking a moment to think about why the expression makes sense. From our long regression, we know that being born in a high-income ZIP Code is associated with an increase in income of 3. And from the auxiliary regression, we know that each additional year of school is associated with an increased probability of being from a high-income ZIP Code of 1 percentage points. So if we don't control for HighIncomeZIPi, then an additional year of school is associated with an increase in income that is directly related to school, 1 and an increase in income that comes from the fact that one more year of school is associated with an increase in HighIncomeZIPi of 1. In other words, omitting HighIncomeZIPi leads to some of the relationship between a person's background and their income to be attributed to their years of school.

As a final step to understanding the equation, we can derive it from our three regression equations. We start by substituting the auxiliary regression into the long regression and rearranging:

Screen Shot 2023-11-21 at 10.58.57 AMScreen Shot 2023-11-21 at 10.59.12 AM

Notice that substituting in the auxiliary regression eliminates the variable HighIncomeZIPi, leaving us with Schooli as the only independent variable, just like in the short regression. By rearranging the expression, we have three distinct sets of terms: a constant, the variable Schooli multiplied by a slope coefficient, and a residual term. This mirrors the structure of the short regression equation, and the different terms are in fact exactly identical to what you would get if you simply estimated the short regression.

Using the Omitted-Variable Bias Equation: Correlation and Causation

The omitted-variable bias equation describes a general statistical relationship between any short and long regression. But the equation is most powerful when used to think about a short regression that describes a correlation in the data and a long regression that describes a causal relationship. Let's stick with our example of education and income, specifying a short, long, and auxiliary regression, but this time let's consider a different omitted variable:

Screen Shot 2023-11-21 at 12.12.38 PM

Now our omitted variable is EarningsPotentiali, which measures how much money person i would earn if they only stayed in school long enough to graduate from high school. It's reasonable to think that if we could control for an individual's earnings potential, then the relationship we would observe between income and school would be causal. That is, the regression coefficient y1 represents the average causal effect of an additional year of school on earnings!

So, what does the omitted-variable bias equation say in this case? Writing out the expression, we have:

Screen Shot 2023-11-21 at 12.14.25 PM

That is, the correlation between education and income that we see in the data represents the causal impact of additional years of school on income and a bias term. What's more, we can use this expression to reason about the direction of the bias. To do so, we need only ask ourselves what we think the sign of the coefficients y2 and p1 are. So, what do you think? Is earnings potential positively or negatively correlated with income? Is it positively or negatively correlated with years of schooling?

If you suspect that people with higher earnings potential make more money and attend school longer, then both y2 and p1 are positive, which means that the bias, y2 x p1, is positive. That is, the relationship we see in the data between school and income is biased upwards relative to the true causal effect, since people who stay in school longer also have higher earnings potential. On the other hand, if you think that people with higher earnings potential make more money but complete less school (maybe because they drop out to start a successful tech company), then p1 is negative and the bias, y2 x p1, is also negative. So the relationship we see in the data between school and income is biased downward relative to the true causal effect, since people who stay in school longer have lower earnings potential.

One powerful thing about using the omitted-variable bias equation in this way is that it allows you to think about omitted variables that you don't actually observe in your data and that you never would be able to observe in real data. EarningsPotentiali is a perfect example of this -- there is no dataset of hypothetical earnings for people if they only finished high school. But we can still reason about how this unobservable variable would be correlated with the variables that we do observe, and this helps us understand how our regression estimates differ from the causal effects we're interested in.

Practice

For economists, applying the omitted-variable bias equation to think about how a regression estimate might be biased is second nature, but it comes from years of practice. Consider the following patterns in the data and try using the omitted-variable bias equation to reason about whether these patterns represent causal relationships or not. If not, do you think the true causal effect is larger or smaller than the observed correlations?

  • People in the hospital on average have worse health than people who aren't in the hospital.
  • Drinking green smoothies is associated with better heart health.
  • Students who go to a professor's office hours get better grades on average.
  • States with higher minimum wages have lower average unemployment.
  • Neighborhood housing costs tend to go up when new luxury apartments are built.

Valentine graduated from the University of Rochester with a BA in Economics. After a 2-year stint as a Peace Corps Volunteer, he is currently pursuing his PhD in Public Policy at Harvard University.

Comments

topicTopics
academics study skills MCAT medical school admissions SAT college admissions expository writing English MD/PhD admissions strategy writing LSAT GMAT physics GRE chemistry biology math graduate admissions academic advice ACT interview prep law school admissions test anxiety language learning career advice premed MBA admissions personal statements homework help AP exams creative writing MD study schedules test prep computer science Common Application summer activities mathematics history philosophy organic chemistry secondary applications economics supplements research 1L PSAT admissions coaching grammar law psychology statistics & probability legal studies ESL dental admissions CARS SSAT covid-19 logic games reading comprehension engineering USMLE calculus mentorship PhD admissions Spanish parents Latin biochemistry case coaching verbal reasoning DAT English literature STEM excel medical school political science skills AMCAS French Linguistics MBA coursework Tutoring Approaches academic integrity admissions advice astrophysics chinese gap year genetics letters of recommendation mechanical engineering Anki DO Social Advocacy algebra art history artificial intelligence business careers cell biology classics dental school diversity statement geometry kinematics linear algebra mental health presentations quantitative reasoning study abroad tech industry technical interviews time management work and activities 2L DMD IB exams ISEE MD/PhD programs Sentence Correction adjusting to college algorithms amino acids analysis essay athletics business skills cold emails data science finance first generation student functions graphing information sessions international students internships logic networking poetry proofs resume revising science social sciences software engineering trigonometry units writer's block 3L AAMC Academic Interest EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian Lagrange multipliers London MD vs PhD MMI Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture argumentative writing art art and design schools art portfolios bacteriology bibliographies biomedicine brain teaser campus visits cantonese capacitors capital markets central limit theorem centrifugal force chemical engineering chess chromatography class participation climate change clinical experience community service constitutional law consulting cover letters curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism escape velocity evolution executive function fellowships freewriting genomics harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite institutional actions integrated reasoning intermolecular forces intern investing investment banking lab reports linear maps mandarin chinese matrices mba medical physics meiosis microeconomics mitosis mnemonics music music theory nervous system neurology neuroscience