Understanding the central limit theorem

academics central limit theorem statistics & probability

The central limit theorem (CLT for short) is an enormously powerful tool that makes much of what we do in statistics possible. But if you just read the actual definition, which you can find below, it’s pretty hard to understand why this theorem is so important. This blog post will help you understand both what the CLT is and why it is important for statistical inference. 

“The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.”

Taking a step back

First, let’s take a step back and imagine a variable that is normally distributed in the population of interest. (If you need a refresher on what this means, check out our blog post on the normal distribution here.) Let’s use the example of height (a variable) for men in the United States (the population of interest).

Screen Shot 2022-05-03 at 4.11.52 PM

We can see here that the mean height for men in the U.S. is 70.9 inches, which is calculated by collecting the heights for every single man in the U.S. and then calculating the mean. This value is called the population mean.

But you can imagine that it’s really hard, probably pretty much impossible, to collect the heights for every single man in the U.S. This means that it’s impossible for us to know the true population mean (AKA the true mean of the heights of all the men in the U.S.). So what do we do?

Taking samples

Because it’s impossible to know the population mean, what we do in statistics is try to gather enough information to let us make good guesses about what the population mean is. In other words, we try to approximate the population mean as best we can with the information we have. 

To gather this information, we need to take random samples that are representative of the population. This means we can find a random group of men who are representative of men in the U.S. and find out what their heights are, and then calculate the mean. This value is called the sample mean. 

But what would happen if we take a sample, calculate the sample mean, and then just use that as our approximation for the population mean? The problem with this approach is that because the samples we take are random, if you take 100 different samples, you will probably get numerous different sample means. This means there’s no way to know if one sample you took is actually a good approximation of the population mean. 

For example, imagine you take a sample of 100 random men in the U.S. and find that the sample mean of their heights is 65 inches. But then imagine you take another sample of 100 different random men in the U.S. and find a sample mean of 75 inches. The differences in random samples that you take are called sampling fluctuations and they are the reason you can’t just assume the population mean based on a single sample mean. 

Enter: the central limit theorem

Now imagine that instead of taking one sample of 100 random men in the U.S., you take 100 such samples. Each of these samples will have its own sample mean, which means we will have 100 total sample means. 

From there, imagine if you plotted a histogram with each of these sample means. Remember that each sample mean is just a number, so you can plot it just like you would with the heights of individuals. The distribution of all 100 sample means is called the sampling distribution. The question is, what would this distribution look like?

This is where the central limit theorem finally kicks in. According to the definition we saw above, the central limit theorem states that the distribution of these sample means (AKA the sampling distribution) will be approximately normal. This means that if we plot the 100 sample means we calculated in a histogram, the distribution of the sample means will be approximately normal.

Putting it all together

To understand why all of this is important, remember the problem I laid out at the start: because we can’t measure the population mean directly, we need a way to approximate it. But we also can’t just take one sample, calculate the mean, and assume that is the population mean. Now the central limit theorem tells us that if we take multiple samples, the means of those samples will be distributed approximately normally. 

The crucial consequence of this is that the mean of the sampling distribution approximates the population mean. In other words, the CLT tells us that as the sample size of each sample increases—i.e. if we took multiple samples of 1,000 men instead of just 100—the mean of the sampling distribution continues to approach the population mean. Here is a step-by-step run-through of the logic:

  1. Take multiple random samples of your population of interest (and it turns out the sample size for each samples needs to be at least 30, so in this case you would need to take multiple samples of the heights of at least 30 men)

  2. Calculate the mean for each sample and plot them to get the sampling distribution (which will be normally distributed)

  3. Calculate the mean of the sampling distribution

  4. Then, the CLT tells us that the mean of the sampling distribution is a good approximation of the population mean, and gets closer to it as the size of each of your samples increases

 

In this way, the CLT helps us approximate the population mean even when we can’t directly measure it by allowing us to make conclusions based on the samples we can directly measure. That is why the CLT is crucial for many of the inferences that underlie the practice of statistics. 

Sources

https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html

http://www.math.iup.edu/~clamb/class/math217/3_1-normal-distribution/

Katie graduated magna cum laude from Columbia University with a double major in Political Science and Philosophy. Now, she is pursuing a Master of Public Policy degree at the Harvard Kennedy School, a program which focuses on empirical skills like economics and statistics for the purpose of applying them to creating effective public policy.

Comments

topicTopics
academics study skills medical school admissions MCAT SAT college admissions expository writing strategy English MD/PhD admissions writing LSAT physics GMAT GRE chemistry academic advice graduate admissions biology math law school admissions ACT interview prep language learning test anxiety personal statements premed career advice MBA admissions AP exams homework help test prep creative writing MD mathematics study schedules Common Application computer science summer activities history secondary applications philosophy organic chemistry research economics supplements 1L grammar statistics & probability PSAT admissions coaching dental admissions psychology law legal studies ESL reading comprehension CARS PhD admissions SSAT covid-19 logic games calculus engineering USMLE medical school mentorship Latin Spanish parents AMCAS admissions advice biochemistry case coaching verbal reasoning DAT English literature STEM excel political science skills French Linguistics MBA coursework Tutoring Approaches academic integrity astrophysics chinese classics dental school gap year genetics letters of recommendation mechanical engineering units Anki DO Social Advocacy algebra art history artificial intelligence business careers cell biology data science diversity statement first generation student freewriting geometry graphing kinematics linear algebra mental health presentations quantitative reasoning study abroad tech industry technical interviews time management work and activities 2L AAMC DMD IB exams ISEE MD/PhD programs MMI Sentence Correction adjusting to college algorithms amino acids analysis essay athletics business skills cold emails executive function fellowships finance functions genomics information sessions international students internships logic networking office hours poetry pre-dental proofs resume revising scholarships science social sciences software engineering trigonometry writer's block 3L Academic Interest EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian JD/MBA admissions Lagrange multipliers London MD vs PhD Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture argumentative writing art art and design schools art portfolios bacteriology bibliographies biomedicine brain teaser burnout campus visits cantonese capacitors capital markets central limit theorem centrifugal force chem/phys chemical engineering chess chromatography class participation climate change clinical experience community service constitutional law consulting cover letters curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism embryology entropy escape velocity evolution extracurriculars fundraising harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite institutional actions integrated reasoning intermolecular forces intern investing investment banking lab reports letter of continued interest linear maps mandarin chinese matrices