Part 1: The Sampling Problem
Your A/B test shows a 12% lift in conversions. The data scientist is confident. The PM is ready to ship.
But should you?
That 12% could be signal, a real improvement your team created. Or it could be noise dressed up in a suit. Random variation that happened to look impressive this time.
The difference between these scenarios is statistics. But unlike most introductions that hand you formulas to memorize, we're going to derive them from scratch. By the end of this post, you won't just know that sample size matters - you'll understand exactly why, down to the mathematics.
What is a sample?
Before we can talk about "sample size" or "margin of error," we need precise definitions. These aren't vocabulary for vocabulary's sake - everything else builds on them.
The population (N) is the complete set of things you care about: all 100,000 customers, all 150 million voters, all widgets ever produced. The sample (n) is the subset you actually measure - the 200 customers you surveyed, the 1,200 voters polled, the 50 widgets inspected.
A parameter (μ) is a fixed but unknown property of the population - the true average customer satisfaction, the actual percentage who prefer candidate A. A statistic (x̄) is the corresponding property calculated from your sample - the average satisfaction in your survey, the percentage in your poll.
Here's what makes statistics necessary:
μ is fixed but unknown. x̄ is known but random.
We can calculate x̄ precisely from our data. But x̄ changes every time we draw a new sample. The parameter μ never changes - we just can't see it directly.
Experiment 1: Click "Draw Sample" several times with n=30. Watch how different population members get selected each time (blue dots). The sample mean x̄ changes with each draw - but the true μ (shown in green) never moves.
Experiment 2: Draw 10 samples and watch the histogram of sample means build up. Notice how x̄ values cluster around μ, but rarely equal it exactly.
Experiment 3: Try both small (n=10) and large (n=100) samples. With larger samples, x̄ values cluster more tightly around μ.
The fundamental question
We're using x̄ (which we know) to estimate μ (which we don't). The question isn't whether x̄ will equal μ - it almost never will. The question is: how close is x̄ likely to be?
This leads to two properties we need from any good estimator. First, unbiasedness: on average, does x̄ hit the target? Second, low variance: how much does x̄ jump around between samples?
Let's derive both mathematically.
Why sample means vary
Sample means are unbiased
First, we'll prove that x̄ is unbiased - meaning on average, it equals μ. This isn't obvious! Why should averaging random samples give you the truth?
Theorem: E[x̄] = μ
The expected value of the sample mean equals the population mean.
Proof:
Let x₁, x₂, ..., xₙ be independent draws from a population with mean μ.
The sample mean is defined as:
x̄ = (x₁ + x₂ + ... + xₙ) / n = (1/n) × Σxᵢ
Taking the expected value of both sides:
E[x̄] = E[(1/n) × Σxᵢ]
Constants factor out of expectations:
= (1/n) × E[Σxᵢ]
Expectation of a sum equals sum of expectations:
= (1/n) × Σ E[xᵢ]
Each xᵢ is drawn from the same population, so E[xᵢ] = μ:
= (1/n) × Σμ = (1/n) × nμ = μ ∎
Watch convergence happen: Click Start and let samples accumulate. The running mean (orange) wobbles initially but steadily approaches μ (green). After 200+ samples, the running mean typically stays within 0.1 of μ.
The math box shows the proof - each step in the derivation corresponds to what you're seeing visually. The key insight: averaging n copies of μ still gives μ.
Unbiasedness means x̄ doesn't systematically miss the target. Over many samples, you'll sometimes overshoot μ and sometimes undershoot, but these errors balance out. This is good news - but it doesn't tell us how far individual samples might be from μ.
The standard error derivation
Now for the centerpiece: where does σ/√n come from?
You've probably seen the formula SE = σ/√n. But why √n? Why not n, or n², or something else entirely?
The answer comes from the properties of variance under addition and scaling.
Theorem: SE(x̄) = σ/√n
The standard error of the sample mean equals the population standard deviation divided by the square root of sample size.
Full Derivation in 6 Steps:
Step 1: Start with the definition of x̄
Var(x̄) = Var((x₁ + x₂ + ... + xₙ) / n)
Step 2: Factor out the constant
When you multiply a random variable by a constant, the variance is multiplied by the constant squared:
Var(aX) = a² × Var(X)
So dividing by n means:
= (1/n²) × Var(x₁ + x₂ + ... + xₙ)
Step 3: Use independence
For independent random variables, variance of a sum equals sum of variances:
Var(X + Y) = Var(X) + Var(Y) (if independent)
Therefore:
= (1/n²) × [Var(x₁) + Var(x₂) + ... + Var(xₙ)]
Step 4: Same distribution
Each xᵢ comes from the same population with variance σ²:
= (1/n²) × [σ² + σ² + ... + σ²] = (1/n²) × nσ²
Step 5: Simplify
= nσ² / n² = σ² / n
Step 6: Take square root
Standard Error is the standard deviation of x̄:
SE = √Var(x̄) = √(σ²/n) = σ/√n
Step through the derivation: Use the Previous/Next buttons to walk through each mathematical step. On the right, watch how the histogram of sample means changes as you select different sample sizes.
See the √n in action: Click n=4, n=16, n=64, n=256. Notice:
- n=4 → SE = 15/2 = 7.5
- n=16 → SE = 15/4 = 3.75
- n=64 → SE = 15/8 = 1.875
Each 4× increase in n cuts SE in half. That's the √n rule!
The √n appears because of two competing effects:
- Adding n samples multiplies variance by n
- Dividing by n (to get the mean) divides variance by n²
Net effect: n/n² = 1/n. Square root gives 1/√n.
This is why quadrupling your sample size only halves your error. The mathematical structure of averaging creates diminishing returns.
The diminishing returns of sample size
Now we can prove the claim that often gets stated without explanation: to cut your error in half, you must quadruple your sample size.
Proof:
We want SE₂ = SE₁/2. Using SE = σ/√n:
σ/√n₂ = (1/2) × σ/√n₁
Solving for n₂:
√n₂ = 2√n₁
n₂ = 4n₁
To halve error, quadruple sample size. QED.
The economic reality
This mathematical result has real-world consequences:
| Sample Size | SE | Relative Precision | Cost (at $10/sample) |
|---|---|---|---|
| 100 | 1.50 | 1× | $1,000 |
| 400 | 0.75 | 2× | $4,000 |
| 1,600 | 0.375 | 4× | $16,000 |
| 6,400 | 0.1875 | 8× | $64,000 |
Adjust cost and budget: Move the sliders to see how sample size and precision change. Notice that the precision curve (green) rises slowly while the cost curve (red) rises linearly. At some point, additional precision isn't worth the cost.
Find the sweet spot: For most surveys, the optimal sample size is between 400-2,500. Beyond that, you're spending exponentially more for marginally better precision.
This explains why pollsters survey about 1,000 people rather than 100,000, why A/B tests need thousands of users (not millions) to detect meaningful effects, and why clinical trials are expensive but not infinitely large.
Precision grows with √n, but cost grows with n. This fundamental mismatch means there's always a point where getting more precise isn't worth it. Smart statistics is knowing where that point is.
Confidence intervals
We've established that x̄ varies around μ with standard deviation σ/√n. But how do we communicate this uncertainty? That's what confidence intervals are for.
Building a 95% CI
From the Normal distribution (which sample means approximately follow thanks to the Central Limit Theorem), we know that 95% of values fall within 1.96 standard deviations of the mean.
Therefore, 95% of sample means fall within:
μ ± 1.96 × SE = μ ± 1.96 × σ/√n
Inverting this, we construct the 95% confidence interval:
x̄ ± 1.96 × σ/√n
Generate many intervals: Click "+50 Intervals" repeatedly. Watch as approximately 95% of intervals (green) contain the true μ, while about 5% (red) miss it.
Count the misses: After 100 intervals, you should see roughly 5 red lines. Each red interval represents a "bad sample" - one where x̄ happened to be unusually far from μ.
What "95% confident" does NOT mean:
"There's a 95% probability that μ is in this interval."
What it DOES mean:
"If I repeat this procedure many times, about 95% of the intervals I create will contain μ."
Each individual interval either contains μ (probability = 1) or doesn't (probability = 0). We just don't know which! The 95% refers to the procedure, not any single interval.
When math can't save you
Everything we've derived assumes one thing: your sample is random.
All the beautiful mathematics above - E[x̄] = μ, SE = σ/√n, confidence intervals - only work if each population member has an equal chance of being sampled.
Bias breaks everything.
Types of bias
Selection bias is when you survey visitors who stayed on your website, but you're trying to understand all potential customers - including those who bounced before the survey loaded.
Survivorship bias is what caught WWII military planners. They studied bullet holes on returning planes to decide where to add armor. Abraham Wald pointed out: the planes that got hit in other places never returned. They were only seeing the survivors.
Non-response bias happens when you email 10,000 customers a survey and 500 respond. Those 500 are probably the most satisfied (who want to praise you) or most frustrated (who want to complain). The silent 9,500 are likely in the middle.
Bias vs. variance
| Random Sample | Biased Sample | |
|---|---|---|
| E[x̄] | μ | Something else |
| Larger n helps? | Yes (reduces SE) | No (just gives more precision on the wrong answer) |
Sample size only addresses random variation. It does nothing for systematic bias.
A biased sample of 10,000 tells you less about the population than a truly random sample of 100. The formula SE = σ/√n is useless if your sample isn't representative.
What we derived
Starting from basic probability theory, we proved that E[x̄] = μ (sample means are unbiased), that Var(x̄) = σ²/n (variance shrinks with sample size), and that SE = σ/√n (the famous square-root law). We showed that to halve error you must quadruple n - diminishing returns are mathematically guaranteed. And we derived the 95% confidence interval: x̄ ± 1.96×SE.
These aren't arbitrary formulas. They follow from the linearity of expectation, the independence of samples, and the properties of variance under scaling.
That's the foundation. But once you have your sample, you need to summarize it. And the way you summarize it matters more than you'd think.
In Part 2, we'll look at mean vs. median, and why the choice between them can completely change the story your data tells.