Part 4: The Central Limit Theorem
In Part 3, we derived the normal distribution and the 68-95-99.7 rule. But we left a mystery unsolved: why does the normal distribution appear everywhere, even in data that isn't normal?
The answer is the Central Limit Theorem. Let's see why it works.
The phenomenon
Before we prove anything, let's observe what happens when we repeatedly add random things together.
Dice rolling experiment
A single die gives 6 equally likely outcomes (1-6). This is a uniform distribution - completely non-normal, completely flat.
But what happens when we add dice together?
Experiment 1: Start with 1 die. The distribution is flat - each outcome equally likely.
Experiment 2: Select 2 dice. Notice the triangular shape - 7 is most common because it has the most combinations (1+6, 2+5, 3+4, 4+3, 5+2, 6+1).
Experiment 3: Go to 6 or 10 dice. The dashed red line shows the theoretical normal curve. The histogram matches almost perfectly.
Uniform + Uniform + Uniform + ... = Normal. Always.
This isn't special to dice. It happens with any distribution. Add uniform random numbers, you get normal. Add exponential random numbers, normal. Add bimodal random numbers, still normal. Add anything, normal.
When you average many independent random things together, the result is approximately normally distributed - regardless of what the original things looked like.
The mathematical statement
Now let's formalize this observation into a theorem.
Setup and assumptions
Conditions for CLT:
Let X₁, X₂, ..., Xₙ be random variables that are:
- Independent: Knowing one value tells you nothing about others
- Identically distributed: All drawn from the same distribution
- Finite variance: The source distribution has variance σ² < ∞
Then the sample mean X̄ₙ = (X₁ + ... + Xₙ)/n follows a specific pattern as n → ∞.
The theorem statement
Walk through the derivation: Use Previous/Next to step through the formal statement. Each step builds on the previous.
Key steps to notice:
- Step 3: Expected value of sample mean equals population mean
- Step 4: Variance shrinks as 1/n
- Step 6: The remarkable conclusion - convergence to normal
The click: Step 6 is the surprising one. Steps 1-5 are just algebra. Step 6 is deep mathematics.
The Central Limit Theorem (Formal)
Let X₁, X₂, ..., Xₙ be i.i.d. random variables with mean μ and finite variance σ².
Define the standardized sample mean:
Zₙ = (X̄ₙ - μ) / (σ/√n)
Then as n → ∞:
Zₙ →ᵈ N(0, 1)
In English: The standardized sample mean converges in distribution to the standard normal.
Why does this happen?
The CLT isn't magic - there's a mathematical reason why adding things creates normality. Let's understand why.
Convolution
When you add two independent random variables, their probability distributions convolve. Mathematically:
If Z = X + Y, then f_Z(z) = ∫ f_X(x) × f_Y(z-x) dx
This convolution has a smoothing effect. Each convolution fills in gaps, reduces peaks, and creates a more symmetric, bell-shaped result.
The dice example in detail
Consider adding two dice:
Sum = 2: Only 1+1 = 1 way
Sum = 3: 1+2, 2+1 = 2 ways
Sum = 4: 1+3, 2+2, 3+1 = 3 ways
...
Sum = 7: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1 = 6 ways
...
Sum = 12: Only 6+6 = 1 way
The middle values have more combinations than the extremes. This is the convolution effect creating a triangular shape.
With more dice, this effect compounds. The middle gets more and more combinations relative to the extremes, creating the characteristic bell shape.
Characteristic functions (the deep reason)
The formal proof uses characteristic functions. The characteristic function of a sum equals the product of characteristic functions:
φ_{X+Y}(t) = φ_X(t) × φ_Y(t)
For the standardized sum, as n → ∞, the characteristic function converges to:
φ(t) = exp(-t²/2)
This is the characteristic function of the standard normal distribution. QED.
Adding random variables multiplies their characteristic functions. When you multiply many things together and take a limit, the result depends only on the first few terms of their Taylor expansions. All distributions with finite variance have the same first two terms - so they all converge to the same limit: the normal distribution.
Seeing CLT in action
Let's verify the CLT empirically with different source distributions.
Experiment 1: Start with Exponential (heavily right-skewed). With n=5, the sample means are still skewed. Increase to n=30 - normality emerges.
Experiment 2: Try Bimodal (two peaks). Even this extreme shape produces normal sample means by n=30.
Experiment 3: Use Discrete (dice). Sample means become continuous and normal despite starting with just 6 discrete values.
The insight: The source distribution's shape doesn't matter. Only n matters.
The n ≥ 30 rule
You've probably heard "sample size should be at least 30." But why 30? Is it magic?
The Berry-Esseen bound
The Berry-Esseen theorem quantifies how fast the CLT converges:
|P(Zₙ ≤ z) - Φ(z)| ≤ C × ρ / (σ³ × √n)
Where:
- ρ = E[|X - μ|³] (related to skewness)
- C ≈ 0.4748 (a constant)
- Φ(z) is the standard normal CDF
Key insight: The error bound shrinks as 1/√n. To halve the error, quadruple n.
Why n = 30 works (usually)
For most "reasonable" distributions, by n = 30 the approximation error is typically less than 5%, and the normal approximation is accurate enough for practical purposes. But this threshold depends on skewness:
Click each distribution type to see the recommended minimum n. Notice how more skewed distributions need larger samples.
The key takeaway: n = 30 is a rule of thumb, not a law of nature. For heavily skewed data, you may need n = 50 or more.
When is n = 30 not enough? Highly skewed distributions (exponential, log-normal), distributions with heavy tails (Pareto, Cauchy), data with extreme outliers, and situations where you need high precision rather than rough approximation. For these cases, either use larger n or consider non-parametric methods.
Why CLT changes everything
The CLT is the foundation of modern statistics. Here's what it enables.
Confidence intervals
We can construct confidence intervals for ANY population, because sample means are always approximately normal:
95% CI: x̄ ± 1.96 × (σ/√n)
This works whether the population is normal, skewed, bimodal, or anything else (given sufficient n).
Hypothesis testing
We can calculate p-values by comparing observed sample means to the normal distribution:
z = (x̄ - μ₀) / (σ/√n)
If |z| > 1.96, the result is "statistically significant" at p < 0.05.
Standard errors
The standard error σ/√n tells us exactly how much sample means vary. This precision comes from the CLT.
Polling and surveys
A poll of 1,000 people can predict the behavior of 330 million because the CLT guarantees the sample proportion is approximately normal, the standard error is √(p(1-p)/n) ≈ 0.016 for n = 1000, and the 95% CI = observed ± 3.2 percentage points.
The CLT bridges the unknown and the known. We don't know the true population distribution. But we do know that sample means follow a normal distribution. This transforms the unknowable (arbitrary distributions) into the well-understood (the normal distribution).
Common misconceptions
"CLT says all data is normally distributed." No - CLT says sample means are normally distributed. Individual data points keep their original distribution.
"I need a huge sample for CLT to work." Not really - n = 30 is usually sufficient. You don't need thousands.
"CLT means I can ignore the original distribution." Wrong - you should still understand your data's shape. CLT applies to means, but medians, maximums, and other statistics don't necessarily follow normal distributions.
"CLT works for any sample size." Not quite - CLT is an asymptotic result (n → ∞). For small n, the approximation may be poor, especially for skewed distributions.
What we derived
From first principles, we established the CLT phenomenon - that adding or averaging random things produces a normal distribution. We gave the formal statement: (X̄ - μ)/(σ/√n) →ᵈ N(0,1) as n → ∞. We explained why it happens: convolution smooths distributions, and characteristic functions converge. We showed the Berry-Esseen bound, where error is proportional to 1/√n and depends on skewness. And we explained the n ≥ 30 rule as a practical threshold that works for most distributions.
The CLT explains why normal distributions appear everywhere, why polling works, and why we can make confident statements from limited samples.
In Part 5, we'll put everything together to understand hypothesis testing - and the dangers of p-hacking and misinterpreting statistical significance.