Back to Blog

Part 4: The Central Limit Theorem

In Part 3, we derived the normal distribution and the 68-95-99.7 rule. But we left a mystery unsolved: why does the normal distribution appear everywhere, even in data that isn't normal?

The answer is the Central Limit Theorem. Let's see why it works.


The phenomenon

Before we prove anything, let's observe what happens when we repeatedly add random things together.

Dice rolling experiment

A single die gives 6 equally likely outcomes (1-6). This is a uniform distribution - completely non-normal, completely flat.

But what happens when we add dice together?

Why Sums Become Normal: The Convolution Effect
Sum of 1 die
3.5
Expected Mean (n × 3.5)
1.71
Std Dev (√(n × 35/12))
Flat
Shape

What's Happening Mathematically

n = 1: Uniform Distribution

A single die has 6 equally likely outcomes (1-6). This is a discrete uniform distribution - nothing remotely normal.

Try It Out

Experiment 1: Start with 1 die. The distribution is flat - each outcome equally likely.

Experiment 2: Select 2 dice. Notice the triangular shape - 7 is most common because it has the most combinations (1+6, 2+5, 3+4, 4+3, 5+2, 6+1).

Experiment 3: Go to 6 or 10 dice. The dashed red line shows the theoretical normal curve. The histogram matches almost perfectly.

Uniform + Uniform + Uniform + ... = Normal. Always.

This isn't special to dice. It happens with any distribution. Add uniform random numbers, you get normal. Add exponential random numbers, normal. Add bimodal random numbers, still normal. Add anything, normal.

Key Insight

When you average many independent random things together, the result is approximately normally distributed - regardless of what the original things looked like.


The mathematical statement

Now let's formalize this observation into a theorem.

Setup and assumptions

The Problem

Conditions for CLT:

Let X₁, X₂, ..., Xₙ be random variables that are:

  1. Independent: Knowing one value tells you nothing about others
  2. Identically distributed: All drawn from the same distribution
  3. Finite variance: The source distribution has variance σ² < ∞

Then the sample mean X̄ₙ = (X₁ + ... + Xₙ)/n follows a specific pattern as n → ∞.

The theorem statement

The Central Limit Theorem: Formal Statement
Step 1 of 7

The Setup

Let X₁, X₂, ..., Xₙ be independent, identically distributed random variables with mean μ and variance σ².

E[Xᵢ] = μ, Var(Xᵢ) = σ²
Try It Out

Walk through the derivation: Use Previous/Next to step through the formal statement. Each step builds on the previous.

Key steps to notice:

  • Step 3: Expected value of sample mean equals population mean
  • Step 4: Variance shrinks as 1/n
  • Step 6: The remarkable conclusion - convergence to normal

The click: Step 6 is the surprising one. Steps 1-5 are just algebra. Step 6 is deep mathematics.

The Problem

The Central Limit Theorem (Formal)

Let X₁, X₂, ..., Xₙ be i.i.d. random variables with mean μ and finite variance σ².

Define the standardized sample mean:

Zₙ = (X̄ₙ - μ) / (σ/√n)

Then as n → ∞:

Zₙ →ᵈ N(0, 1)

In English: The standardized sample mean converges in distribution to the standard normal.


Why does this happen?

The CLT isn't magic - there's a mathematical reason why adding things creates normality. Let's understand why.

Convolution

When you add two independent random variables, their probability distributions convolve. Mathematically:

If Z = X + Y, then f_Z(z) = ∫ f_X(x) × f_Y(z-x) dx

This convolution has a smoothing effect. Each convolution fills in gaps, reduces peaks, and creates a more symmetric, bell-shaped result.

The dice example in detail

Consider adding two dice:

Sum = 2: Only 1+1 = 1 way
Sum = 3: 1+2, 2+1 = 2 ways
Sum = 4: 1+3, 2+2, 3+1 = 3 ways
...
Sum = 7: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1 = 6 ways
...
Sum = 12: Only 6+6 = 1 way

The middle values have more combinations than the extremes. This is the convolution effect creating a triangular shape.

With more dice, this effect compounds. The middle gets more and more combinations relative to the extremes, creating the characteristic bell shape.

Characteristic functions (the deep reason)

The formal proof uses characteristic functions. The characteristic function of a sum equals the product of characteristic functions:

φ_{X+Y}(t) = φ_X(t) × φ_Y(t)

For the standardized sum, as n → ∞, the characteristic function converges to:

φ(t) = exp(-t²/2)

This is the characteristic function of the standard normal distribution. QED.

Key Insight

Adding random variables multiplies their characteristic functions. When you multiply many things together and take a limit, the result depends only on the first few terms of their Taylor expansions. All distributions with finite variance have the same first two terms - so they all converge to the same limit: the normal distribution.


Seeing CLT in action

Let's verify the CLT empirically with different source distributions.

CLT in Action: Any Distribution → Normal Sample Means

Right-skewed (Exponential)

n=5 (non-normal)n=30 (threshold)n=100 (very normal)

Distribution of 500 Sample Means (each from n=30 observations)

0.00
Mean of Sample Means
0.000
Standard Error (SE = σ/√n)

CLT in effect! With n ≥ 30, the sample means form a nearly normal distribution, even though the source (Right-skewed (Exponential)) is far from normal.

Try It Out

Experiment 1: Start with Exponential (heavily right-skewed). With n=5, the sample means are still skewed. Increase to n=30 - normality emerges.

Experiment 2: Try Bimodal (two peaks). Even this extreme shape produces normal sample means by n=30.

Experiment 3: Use Discrete (dice). Sample means become continuous and normal despite starting with just 6 discrete values.

The insight: The source distribution's shape doesn't matter. Only n matters.


The n ≥ 30 rule

You've probably heard "sample size should be at least 30." But why 30? Is it magic?

The Berry-Esseen bound

The Berry-Esseen theorem quantifies how fast the CLT converges:

|P(Zₙ ≤ z) - Φ(z)| ≤ C × ρ / (σ³ × √n)

Where:

  • ρ = E[|X - μ|³] (related to skewness)
  • C ≈ 0.4748 (a constant)
  • Φ(z) is the standard normal CDF

Key insight: The error bound shrinks as 1/√n. To halve the error, quadruple n.

Why n = 30 works (usually)

For most "reasonable" distributions, by n = 30 the approximation error is typically less than 5%, and the normal approximation is accurate enough for practical purposes. But this threshold depends on skewness:

The n ≥ 30 Rule: When Does CLT Kick In?
Source DistributionMinimum n for CLTReason
Symmetric (roughly normal)n ≥ 15Already close to normal; little work for CLT
Moderately skewedn ≥ 30The conventional threshold; works for most distributions
Heavily skewed (exponential, Pareto)n ≥ 50+Long tails take more averaging to tame

Why These Thresholds?

The Berry-Esseen Theorem

The rate of convergence to normality depends on the third moment (skewness) of the source distribution:

|P(Zₙ ≤ z) - Φ(z)| ≤ C × ρ / (σ³ × √n)

where ρ = E[|X - μ|³]. Higher skewness (larger ρ) → slower convergence.

Practical Rule of Thumb

n = 30 works because for most "reasonable" distributions, the approximation error drops below 5% by this point. For very skewed distributions, you need larger n to achieve the same accuracy.

The Key Insight

n = 30 isn't magic - it's a practical threshold that works for most common distributions. The more your source distribution deviates from normal (especially with heavy skew or outliers), the larger n needs to be. When in doubt, simulate!

Try It Out

Click each distribution type to see the recommended minimum n. Notice how more skewed distributions need larger samples.

The key takeaway: n = 30 is a rule of thumb, not a law of nature. For heavily skewed data, you may need n = 50 or more.

Warning

When is n = 30 not enough? Highly skewed distributions (exponential, log-normal), distributions with heavy tails (Pareto, Cauchy), data with extreme outliers, and situations where you need high precision rather than rough approximation. For these cases, either use larger n or consider non-parametric methods.


Why CLT changes everything

The CLT is the foundation of modern statistics. Here's what it enables.

Confidence intervals

We can construct confidence intervals for ANY population, because sample means are always approximately normal:

95% CI: x̄ ± 1.96 × (σ/√n)

This works whether the population is normal, skewed, bimodal, or anything else (given sufficient n).

Hypothesis testing

We can calculate p-values by comparing observed sample means to the normal distribution:

z = (x̄ - μ₀) / (σ/√n)

If |z| > 1.96, the result is "statistically significant" at p < 0.05.

Standard errors

The standard error σ/√n tells us exactly how much sample means vary. This precision comes from the CLT.

Polling and surveys

A poll of 1,000 people can predict the behavior of 330 million because the CLT guarantees the sample proportion is approximately normal, the standard error is √(p(1-p)/n) ≈ 0.016 for n = 1000, and the 95% CI = observed ± 3.2 percentage points.

Key Insight

The CLT bridges the unknown and the known. We don't know the true population distribution. But we do know that sample means follow a normal distribution. This transforms the unknowable (arbitrary distributions) into the well-understood (the normal distribution).


Common misconceptions

Warning

"CLT says all data is normally distributed." No - CLT says sample means are normally distributed. Individual data points keep their original distribution.

"I need a huge sample for CLT to work." Not really - n = 30 is usually sufficient. You don't need thousands.

"CLT means I can ignore the original distribution." Wrong - you should still understand your data's shape. CLT applies to means, but medians, maximums, and other statistics don't necessarily follow normal distributions.

"CLT works for any sample size." Not quite - CLT is an asymptotic result (n → ∞). For small n, the approximation may be poor, especially for skewed distributions.


What we derived

From first principles, we established the CLT phenomenon - that adding or averaging random things produces a normal distribution. We gave the formal statement: (X̄ - μ)/(σ/√n) →ᵈ N(0,1) as n → ∞. We explained why it happens: convolution smooths distributions, and characteristic functions converge. We showed the Berry-Esseen bound, where error is proportional to 1/√n and depends on skewness. And we explained the n ≥ 30 rule as a practical threshold that works for most distributions.

The CLT explains why normal distributions appear everywhere, why polling works, and why we can make confident statements from limited samples.

In Part 5, we'll put everything together to understand hypothesis testing - and the dangers of p-hacking and misinterpreting statistical significance.