Part 5: Hypothesis Testing & P-Hacking
In Part 4, we derived the Central Limit Theorem. Now let's put everything together to understand hypothesis testing - how it works, why it fails, and how to avoid being fooled.
Deriving hypothesis testing
Before we can critique p-values, we need to understand where they come from. Let's derive hypothesis testing from first principles.
The core problem
You observe data and want to answer: Is this a real effect, or just noise?
Example: Your A/B test shows 4.2% conversion (B) vs 3.8% (A). That's an 11% relative improvement. But random samples vary. Maybe this difference is just luck?
The logic of testing
Walk through the derivation: Use Previous/Next to step through the logic of hypothesis testing.
Key insight at Step 4: The p-value is calculated by asking "how extreme is our result?" under the assumption that there's no effect.
Common trap at Step 7: p < 0.05 does NOT mean "95% probability effect is real." It means "5% chance of seeing this if nothing is happening."
The Hypothesis Testing Recipe
- State null hypothesis H₀: "There is no effect"
- Calculate test statistic: z = (observed - expected) / SE
- Find p-value: P(|Z| ≥ |z|) under H₀
- If p < α (usually 0.05), reject H₀
What this means: "If there were no effect, results this extreme would happen less than 5% of the time. So probably something real is happening."
What p-values actually mean
Let's be mathematically precise:
p-value = P(data at least this extreme | H₀ is true)
This is not P(H₀ is true | data), not P(effect is real), and not P(result is a fluke).
The p-value is about the data given H₀, not H₀ given the data. This confusion is called the "prosecutor's fallacy" and it's pervasive.
The p-value answers the wrong question. You want to know: "Is this effect real?" (P(H₁ | data)). The p-value tells you: "How surprising is this data if the effect isn't real?" (P(data | H₀)). These are not the same thing. Bayes' theorem relates them, but the relationship depends on prior probabilities that p-values ignore.
Type I and Type II errors
Every hypothesis test can make two kinds of mistakes.
The error matrix
| H₀ True (No Effect) | H₁ True (Effect Exists) | |
|---|---|---|
| Don't Reject H₀ | Correct ✓ | Type II Error (β) - Missed Effect |
| Reject H₀ | Type I Error (α) - False Positive | Correct ✓ (Power) |
A Type I error (α) is claiming an effect when there isn't one. This is what p < 0.05 controls - we accept a 5% false positive rate.
A Type II error (β) is missing a real effect. This is related to statistical power = 1 - β.
Experiment 1: Set a small true effect (0.2) with small sample size (30). Notice the low power - you'll miss real effects often.
Experiment 2: Increase sample size to 150. Watch power increase above 80% (the conventional target).
Experiment 3: Keep sample size at 50 but increase effect size to 0.8. Large effects are easier to detect.
The trade-off: You can't reduce both errors simultaneously without increasing sample size.
The power formula
Statistical power depends on three things:
Power = P(reject H₀ | H₁ is true)
Power = Φ(|effect|/SE - z_α/2) + Φ(-|effect|/SE - z_α/2)
Where:
- effect = true difference between groups
- SE = σ/√n (standard error)
- z_α/2 = 1.96 for α = 0.05
Key insight: Power increases with larger effect, larger n, or larger α.
This is why sample size calculations matter. Before running a study, you should:
- Estimate expected effect size
- Choose desired power (typically 80%)
- Calculate required n
Many studies are underpowered - they have too few participants to detect realistic effects. An underpowered study that finds "no significant effect" doesn't mean there's no effect - it means the study couldn't detect it.
"Not significant" does not mean "no effect." A study with n = 20 and power = 30% will miss real effects 70% of the time. Finding p = 0.15 in such a study tells you almost nothing - you simply didn't have enough data.
The multiple testing problem
Here's where hypothesis testing breaks down in practice.
The mathematical problem
With a 5% false positive rate per test, what happens when you run multiple tests?
Experiment 1: Set tests = 1 and α = 0.05. Probability of a false positive is exactly 5%.
Experiment 2: Increase to 20 tests. The probability of AT LEAST ONE false positive jumps to 64%!
Experiment 3: Look at the Bonferroni-corrected α. To maintain 5% overall error with 20 tests, each test must use α = 0.0025.
The math: P(≥1 FP) = 1 - (1-α)^n grows rapidly with n.
The Multiple Testing Formula
If you run n independent tests at significance level α:
P(at least one false positive) = 1 - (1 - α)^n
For n = 20 and α = 0.05:
P(≥1 FP) = 1 - (0.95)^20 = 1 - 0.358 = 64.2%
Correction methods:
- Bonferroni: Use α/n for each test
- Benjamini-Hochberg: Control false discovery rate (FDR)
- Pre-registration: Specify tests before seeing data
P-hacking in action
P-hacking is running multiple analyses until you find p < 0.05, then reporting only that result. Let's see it in action:
Experiment 1: Test 5 variables. You might get lucky, or you might not.
Experiment 2: Test 20 variables. You'll almost always find something "significant" - even though NOTHING is real.
Experiment 3: When you find a "significant" result, read the fake headline. It sounds convincing! This is how p-hacking produces publishable-looking research about nothing.
Remember: There is NO real effect in this simulation. Every "discovery" is a false positive.
Real p-hacking techniques include testing multiple outcome variables and reporting only significant ones, removing "outliers" selectively until p < 0.05, collecting data until p < 0.05 then stopping, trying different subgroups (by age, gender, education) until one works, and including or excluding covariates to fish for significance. None of these are fraud in the traditional sense - researchers often don't realize they're doing it. But the result is the same: false positives masquerading as discoveries.
Effect size vs. statistical significance
A result can be statistically significant but practically meaningless. Or practically important but statistically non-significant.
The problem with pure p-values
Experiment 1: Set effect size = 0.05 (tiny) and sample size = 10,000. Watch the p-value become highly significant despite the trivial effect.
Experiment 2: Set effect size = 0.6 (medium-large) and sample size = 20. The effect isn't statistically significant because n is too small.
The insight: Statistical significance tells you whether an effect is distinguishable from zero. It says nothing about whether the effect is large enough to matter.
Effect size benchmarks
Cohen's d is a standardized effect size:
| d | Interpretation |
|---|---|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |
For context: the height difference between 15 and 16-year-old girls is about d ≈ 0.2 (small), the IQ difference between PhD holders and college students is about d ≈ 0.5 (medium), and the height difference between 13 and 18-year-old girls is about d ≈ 0.8 (large).
Always report effect size alongside p-values. A drug that lowers blood pressure by 0.5 mmHg with p = 0.001 is useless. A drug that lowers blood pressure by 10 mmHg with p = 0.07 might save lives. The p-value tells you if the effect is likely real. The effect size tells you if it matters.
Your statistical BS detector
You now have the mathematical foundation to evaluate statistical claims critically.
The seven questions
What was the sample size? (Part 1) Small n means noisy estimates, low power, unreliable results.
Is that mean or median? (Part 2) For skewed data, mean can be misleading. Ask which was used and why.
What's the spread? (Part 2) An average without standard deviation is half the picture.
What's the shape of the data? (Part 3) Symmetric, skewed, bimodal? Shape determines which statistics make sense.
What does "significant" actually mean here? (Part 5) p < 0.05 means "unlikely if nothing is happening," not "95% likely to be real."
How many things did they test? (Part 5) Testing 20 variables and finding 1 significant is expected noise. Ask about pre-registration.
What's the effect size? (Part 5) A tiny effect with p = 0.001 might be statistically significant but practically worthless.
Red flags
| Red Flag | What It Might Mean |
|---|---|
| "The average..." without specifying mean/median | Cherry-picked to sound better |
| Only p-value, no effect size | Effect might be real but trivially small |
| One significant finding among many | Classic p-hacking |
| Very small sample size | Too noisy to trust |
| "Nearly significant" (p = 0.06) | Trying to spin a null result |
| No confidence interval | Hiding uncertainty |
| "The data proves..." | Statistics never proves; it gives probability |
The replication crisis
In 2015, researchers tried to replicate 100 psychology studies published in top journals. Only 36% replicated successfully.
Why so many false positives?
The math we've derived explains this. Publication bias means journals prefer p < 0.05, so only "significant" results get published. P-hacking means researchers (often unknowingly) try multiple analyses until something works. Low power means many studies are too small to detect realistic effects. And even legitimate studies have a 5% false positive rate.
The result: published literature is enriched with false positives.
The solution
How to fix the replication crisis
Pre-registration: specify hypotheses and analysis plan before collecting data. Larger samples: power > 80% for expected effects. Replications: value replication studies as much as novel findings. Effect sizes: report and evaluate effect sizes, not just p-values. Bayesian methods: consider alternatives to null hypothesis testing.
A single study with p < 0.05 is weak evidence. The threshold isn't "proof" - it's a starting point. Extraordinary claims require extraordinary evidence: multiple replications, large samples, mechanistic understanding. When someone says "studies show..." ask: How many studies? What were the sample sizes? Did they replicate? What's the effect size?
What we derived
From first principles across this 5-part series, we established the foundations of statistical inference.
In Part 1 (Sampling), we proved E[x̄] = μ (sample means are unbiased), SE = σ/√n (the square root law), and that to halve error you must quadruple n.
In Part 2 (Central Tendency & Spread), we derived breakdown points of 0% for mean and 50% for median, and showed where σ = √[Σ(x-μ)²/n] comes from.
In Part 3 (Distributions), we derived the normal PDF: f(x) = (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)), and showed that the 68-95-99.7 rule emerges from integration.
In Part 4 (CLT), we proved that sample means approach a normal distribution regardless of source, with convergence rate depending on skewness.
In Part 5 (Hypothesis Testing), we derived p-value = P(data this extreme | H₀ true), showed that P(≥1 FP in n tests) = 1 - (1-α)^n, and emphasized that effect size matters more than p-values.
Statistics is not about certainty. It's about quantifying uncertainty and making good decisions despite incomplete information. When someone presents a statistical claim with certainty, be skeptical. The math we've derived shows that uncertainty is inherent. The best we can do is measure how wrong we might be - and make good decisions anyway.