Back to Blog

Part 1: The Sampling Problem

Your A/B test shows a 12% lift in conversions. The data scientist is confident. The PM is ready to ship.

But should you?

That 12% could be signal, a real improvement your team created. Or it could be noise dressed up in a suit. Random variation that happened to look impressive this time.

The difference between these scenarios is statistics. But unlike most introductions that hand you formulas to memorize, we're going to derive them from scratch. By the end of this post, you won't just know that sample size matters - you'll understand exactly why, down to the mathematics.


What is a sample?

Before we can talk about "sample size" or "margin of error," we need precise definitions. These aren't vocabulary for vocabulary's sake - everything else builds on them.

The population (N) is the complete set of things you care about: all 100,000 customers, all 150 million voters, all widgets ever produced. The sample (n) is the subset you actually measure - the 200 customers you surveyed, the 1,200 voters polled, the 50 widgets inspected.

A parameter (μ) is a fixed but unknown property of the population - the true average customer satisfaction, the actual percentage who prefer candidate A. A statistic (x̄) is the corresponding property calculated from your sample - the average satisfaction in your survey, the percentage in your poll.

Here's what makes statistics necessary:

The Problem

μ is fixed but unknown. x̄ is known but random.

We can calculate x̄ precisely from our data. But x̄ changes every time we draw a new sample. The parameter μ never changes - we just can't see it directly.

Population vs Sample Draw samples from the population and watch how sample means vary
Sample size (n): 300 samples drawn
Population (N = 500, true μ = 50.1)
Each dot is one population member. Blue = selected in current sample.
Key Insight
The population parameter μ = 50.1 is fixed (we'll never know it in real life). The sample statistic x̄ is random - it changes with each sample. Statistics is the art of using the random x̄ to estimate the fixed μ.
Try It Out

Experiment 1: Click "Draw Sample" several times with n=30. Watch how different population members get selected each time (blue dots). The sample mean x̄ changes with each draw - but the true μ (shown in green) never moves.

Experiment 2: Draw 10 samples and watch the histogram of sample means build up. Notice how x̄ values cluster around μ, but rarely equal it exactly.

Experiment 3: Try both small (n=10) and large (n=100) samples. With larger samples, x̄ values cluster more tightly around μ.

The fundamental question

We're using x̄ (which we know) to estimate μ (which we don't). The question isn't whether x̄ will equal μ - it almost never will. The question is: how close is x̄ likely to be?

This leads to two properties we need from any good estimator. First, unbiasedness: on average, does x̄ hit the target? Second, low variance: how much does x̄ jump around between samples?

Let's derive both mathematically.


Why sample means vary

Sample means are unbiased

First, we'll prove that x̄ is unbiased - meaning on average, it equals μ. This isn't obvious! Why should averaging random samples give you the truth?

The Problem

Theorem: E[x̄] = μ

The expected value of the sample mean equals the population mean.

Proof:

Let x₁, x₂, ..., xₙ be independent draws from a population with mean μ.

The sample mean is defined as:

x̄ = (x₁ + x₂ + ... + xₙ) / n = (1/n) × Σxᵢ

Taking the expected value of both sides:

E[x̄] = E[(1/n) × Σxᵢ]

Constants factor out of expectations:

     = (1/n) × E[Σxᵢ]

Expectation of a sum equals sum of expectations:

     = (1/n) × Σ E[xᵢ]

Each xᵢ is drawn from the same population, so E[xᵢ] = μ:

     = (1/n) × Σμ = (1/n) × nμ = μ  ∎
Proving E[x̄] = μ Watch as the average of many sample means converges to the true population mean
Sample size per draw: n = 30Population: μ = 50, σ = 15
0 / 500
⏳ Collecting samples...
Start to see histogram...μ = 50
Running Mean of x̄'s
True μ
50.000
Distance
Mathematical Proof: E[x̄] = μ
E[x̄] = E[(x₁ + x₂ + ... + xₙ)/n]
    = [E[x₁] + E[x₂] + ... + E[xₙ]]/n
    = [μ + μ + ... + μ]/n
    = nμ/n = μ
Try It Out

Watch convergence happen: Click Start and let samples accumulate. The running mean (orange) wobbles initially but steadily approaches μ (green). After 200+ samples, the running mean typically stays within 0.1 of μ.

The math box shows the proof - each step in the derivation corresponds to what you're seeing visually. The key insight: averaging n copies of μ still gives μ.

Key Insight

Unbiasedness means x̄ doesn't systematically miss the target. Over many samples, you'll sometimes overshoot μ and sometimes undershoot, but these errors balance out. This is good news - but it doesn't tell us how far individual samples might be from μ.

The standard error derivation

Now for the centerpiece: where does σ/√n come from?

You've probably seen the formula SE = σ/√n. But why √n? Why not n, or n², or something else entirely?

The answer comes from the properties of variance under addition and scaling.

The Problem

Theorem: SE(x̄) = σ/√n

The standard error of the sample mean equals the population standard deviation divided by the square root of sample size.

Full Derivation in 6 Steps:

Step 1: Start with the definition of x̄

Var(x̄) = Var((x₁ + x₂ + ... + xₙ) / n)

Step 2: Factor out the constant

When you multiply a random variable by a constant, the variance is multiplied by the constant squared:

Var(aX) = a² × Var(X)

So dividing by n means:

       = (1/n²) × Var(x₁ + x₂ + ... + xₙ)

Step 3: Use independence

For independent random variables, variance of a sum equals sum of variances:

Var(X + Y) = Var(X) + Var(Y)   (if independent)

Therefore:

       = (1/n²) × [Var(x₁) + Var(x₂) + ... + Var(xₙ)]

Step 4: Same distribution

Each xᵢ comes from the same population with variance σ²:

       = (1/n²) × [σ² + σ² + ... + σ²] = (1/n²) × nσ²

Step 5: Simplify

       = nσ² / n² = σ² / n

Step 6: Take square root

Standard Error is the standard deviation of x̄:

SE = √Var(x̄) = √(σ²/n) = σ/√n
The Standard Error Derivation Where does σ/√n come from? Step through the mathematical proof.
Mathematical Derivation
Var(x̄) = Var[(x₁ + x₂ + ... + xₙ)/n]
Start with the definition: x̄ is the sum of samples divided by n
= (1/n²) × Var(x₁ + x₂ + ... + xₙ)
= (1/n²) × [Var(x₁) + Var(x₂) + ... + Var(xₙ)]
= (1/n²) × [σ² + σ² + ... + σ²] (n terms)
= (1/n²) × nσ² = σ²/n
SE = √(σ²/n) = σ/√n
Visual Proof (n = 16)
Theoretical SE
15/√16 = 3.75
Observed SE
3.75
See the √n in Action
n = 4
SE = 7.50
15/√4 = 15/2
n = 16
SE = 3.75
15/√16 = 15/4
n = 64
SE = 1.88
15/√64 = 15/8
n = 256
SE = 0.94
15/√256 = 15/16
4× the sample size → 2× the precision (SE halves). This is the √n rule!
Try It Out

Step through the derivation: Use the Previous/Next buttons to walk through each mathematical step. On the right, watch how the histogram of sample means changes as you select different sample sizes.

See the √n in action: Click n=4, n=16, n=64, n=256. Notice:

  • n=4 → SE = 15/2 = 7.5
  • n=16 → SE = 15/4 = 3.75
  • n=64 → SE = 15/8 = 1.875

Each 4× increase in n cuts SE in half. That's the √n rule!

Key Insight

The √n appears because of two competing effects:

  1. Adding n samples multiplies variance by n
  2. Dividing by n (to get the mean) divides variance by n²

Net effect: n/n² = 1/n. Square root gives 1/√n.

This is why quadrupling your sample size only halves your error. The mathematical structure of averaging creates diminishing returns.


The diminishing returns of sample size

Now we can prove the claim that often gets stated without explanation: to cut your error in half, you must quadruple your sample size.

Proof:

We want SE₂ = SE₁/2. Using SE = σ/√n:

σ/√n₂ = (1/2) × σ/√n₁

Solving for n₂:

√n₂ = 2√n₁
n₂ = 4n₁

To halve error, quadruple sample size. QED.

The economic reality

This mathematical result has real-world consequences:

Sample SizeSERelative PrecisionCost (at $10/sample)
1001.50$1,000
4000.75$4,000
1,6000.375$16,000
6,4000.1875$64,000
The Cost of Precision See why each improvement in precision costs exponentially more
Cost per sample: $10
Budget: $1000
Max Sample Size
n = 100
Standard Error
1.50
Cost
$1000
Sample Size (n)ValuePrecision (√n)Cost (linear)
What it costs to improve precision
n = 100
SE = 1.50
$1,000
n = 400
SE = 0.75
$4,000
n = 1600
SE = 0.38
$16,000
To halve SE from 1.50 to 0.75, you need 4× the budget!
The Diminishing Returns Rule
To halve your error, you must quadruple your sample size.

This isn't inefficiency - it's the mathematics of SE = σ/√n.
Pollsters stop at n ≈ 1,000 because going to n = 10,000 would cost 10× more but only improve precision by √10 ≈ 3.16×.
Try It Out

Adjust cost and budget: Move the sliders to see how sample size and precision change. Notice that the precision curve (green) rises slowly while the cost curve (red) rises linearly. At some point, additional precision isn't worth the cost.

Find the sweet spot: For most surveys, the optimal sample size is between 400-2,500. Beyond that, you're spending exponentially more for marginally better precision.

This explains why pollsters survey about 1,000 people rather than 100,000, why A/B tests need thousands of users (not millions) to detect meaningful effects, and why clinical trials are expensive but not infinitely large.

Key Insight

Precision grows with √n, but cost grows with n. This fundamental mismatch means there's always a point where getting more precise isn't worth it. Smart statistics is knowing where that point is.


Confidence intervals

We've established that x̄ varies around μ with standard deviation σ/√n. But how do we communicate this uncertainty? That's what confidence intervals are for.

Building a 95% CI

From the Normal distribution (which sample means approximately follow thanks to the Central Limit Theorem), we know that 95% of values fall within 1.96 standard deviations of the mean.

Therefore, 95% of sample means fall within:

μ ± 1.96 × SE = μ ± 1.96 × σ/√n

Inverting this, we construct the 95% confidence interval:

x̄ ± 1.96 × σ/√n
What Does 95% Confidence Mean? Watch as ~95% of confidence intervals capture the true μ
True μ
50
σ
15
n
100
SE
1.50
True μ = 504447505356Generate intervals to see them here
Total Intervals
0
Contain μ (green)
0
Coverage Rate
0.0%
What "95% Confident" Really Means
"95% confident" does NOT mean "95% probability that μ is in this interval."

It means: "If I repeat this procedure many times, about 95% of the intervals I create will contain μ."

Each individual interval either contains μ (probability = 1) or doesn't (probability = 0). We just don't know which!
Try It Out

Generate many intervals: Click "+50 Intervals" repeatedly. Watch as approximately 95% of intervals (green) contain the true μ, while about 5% (red) miss it.

Count the misses: After 100 intervals, you should see roughly 5 red lines. Each red interval represents a "bad sample" - one where x̄ happened to be unusually far from μ.

Warning

What "95% confident" does NOT mean:

"There's a 95% probability that μ is in this interval."

What it DOES mean:

"If I repeat this procedure many times, about 95% of the intervals I create will contain μ."

Each individual interval either contains μ (probability = 1) or doesn't (probability = 0). We just don't know which! The 95% refers to the procedure, not any single interval.


When math can't save you

Everything we've derived assumes one thing: your sample is random.

Warning

All the beautiful mathematics above - E[x̄] = μ, SE = σ/√n, confidence intervals - only work if each population member has an equal chance of being sampled.

Bias breaks everything.

Types of bias

Selection bias is when you survey visitors who stayed on your website, but you're trying to understand all potential customers - including those who bounced before the survey loaded.

Survivorship bias is what caught WWII military planners. They studied bullet holes on returning planes to decide where to add armor. Abraham Wald pointed out: the planes that got hit in other places never returned. They were only seeing the survivors.

Non-response bias happens when you email 10,000 customers a survey and 500 respond. Those 500 are probably the most satisfied (who want to praise you) or most frustrated (who want to complain). The silent 9,500 are likely in the middle.

Bias vs. variance

Random SampleBiased Sample
E[x̄]μSomething else
Larger n helps?Yes (reduces SE)No (just gives more precision on the wrong answer)
Key Insight

Sample size only addresses random variation. It does nothing for systematic bias.

A biased sample of 10,000 tells you less about the population than a truly random sample of 100. The formula SE = σ/√n is useless if your sample isn't representative.


What we derived

Starting from basic probability theory, we proved that E[x̄] = μ (sample means are unbiased), that Var(x̄) = σ²/n (variance shrinks with sample size), and that SE = σ/√n (the famous square-root law). We showed that to halve error you must quadruple n - diminishing returns are mathematically guaranteed. And we derived the 95% confidence interval: x̄ ± 1.96×SE.

These aren't arbitrary formulas. They follow from the linearity of expectation, the independence of samples, and the properties of variance under scaling.

That's the foundation. But once you have your sample, you need to summarize it. And the way you summarize it matters more than you'd think.

In Part 2, we'll look at mean vs. median, and why the choice between them can completely change the story your data tells.