Part 5: Hypothesis Testing & P-Hacking

In Part 4, we derived the Central Limit Theorem. Now let's put everything together to understand hypothesis testing - how it works, why it fails, and how to avoid being fooled.

Deriving hypothesis testing

Before we can critique p-values, we need to understand where they come from. Let's derive hypothesis testing from first principles.

The core problem

You observe data and want to answer: Is this a real effect, or just noise?

Example: Your A/B test shows 4.2% conversion (B) vs 3.8% (A). That's an 11% relative improvement. But random samples vary. Maybe this difference is just luck?

The logic of testing

Try It Out

Walk through the derivation: Use Previous/Next to step through the logic of hypothesis testing.

Key insight at Step 4: The p-value is calculated by asking "how extreme is our result?" under the assumption that there's no effect.

Common trap at Step 7: p < 0.05 does NOT mean "95% probability effect is real." It means "5% chance of seeing this if nothing is happening."

The Problem

The Hypothesis Testing Recipe

State null hypothesis H₀: "There is no effect"
Calculate test statistic: z = (observed - expected) / SE
Find p-value: P(|Z| ≥ |z|) under H₀
If p < α (usually 0.05), reject H₀

What this means: "If there were no effect, results this extreme would happen less than 5% of the time. So probably something real is happening."

What p-values actually mean

Let's be mathematically precise:

p-value = P(data at least this extreme | H₀ is true)

This is not P(H₀ is true | data), not P(effect is real), and not P(result is a fluke).

The p-value is about the data given H₀, not H₀ given the data. This confusion is called the "prosecutor's fallacy" and it's pervasive.

Key Insight

The p-value answers the wrong question. You want to know: "Is this effect real?" (P(H₁ | data)). The p-value tells you: "How surprising is this data if the effect isn't real?" (P(data | H₀)). These are not the same thing. Bayes' theorem relates them, but the relationship depends on prior probabilities that p-values ignore.

Type I and Type II errors

Every hypothesis test can make two kinds of mistakes.

The error matrix

	H₀ True (No Effect)	H₁ True (Effect Exists)
Don't Reject H₀	Correct ✓	Type II Error (β) - Missed Effect
Reject H₀	Type I Error (α) - False Positive	Correct ✓ (Power)

A Type I error (α) is claiming an effect when there isn't one. This is what p < 0.05 controls - we accept a 5% false positive rate.

A Type II error (β) is missing a real effect. This is related to statistical power = 1 - β.

Type I & Type II Errors: The Trade-off

True Effect Size: 0.50

Sample Size (n): 50

	Reality: No Effect (H₀ true)	Reality: Effect Exists (H₁ true)
Test: Not Significant	Correct! ✓ True Negative P = 1 - α = 95%	Type II Error (β) False Negative - missed effect P = 4.5%
Test: Significant	Type I Error (α) False Positive - found nothing P = α = 5%	Correct! ✓ (Power) True Positive - detected effect P = 1 - β = 95.5%

Statistical Power = 95.5%

0% (useless)80% (standard target)100% (perfect)

✓ Adequate power! You'll likely detect the effect if it exists.

The Power Formula

Standard Error: SE = σ / √n = 1 / √50 = 0.1414

Effect in SE units: z_effect = 0.50 / 0.1414 = 3.54

Power = 1 - β = P(reject H₀ | effect = 0.50)

= 95.5%

Try It Out

Experiment 1: Set a small true effect (0.2) with small sample size (30). Notice the low power - you'll miss real effects often.

Experiment 2: Increase sample size to 150. Watch power increase above 80% (the conventional target).

Experiment 3: Keep sample size at 50 but increase effect size to 0.8. Large effects are easier to detect.

The trade-off: You can't reduce both errors simultaneously without increasing sample size.

The power formula

Statistical power depends on three things:

The Problem

Power = P(reject H₀ | H₁ is true)

Power = Φ(|effect|/SE - z_α/2) + Φ(-|effect|/SE - z_α/2)

Where:

effect = true difference between groups
SE = σ/√n (standard error)
z_α/2 = 1.96 for α = 0.05

Key insight: Power increases with larger effect, larger n, or larger α.

This is why sample size calculations matter. Before running a study, you should:

Estimate expected effect size
Choose desired power (typically 80%)
Calculate required n

Many studies are underpowered - they have too few participants to detect realistic effects. An underpowered study that finds "no significant effect" doesn't mean there's no effect - it means the study couldn't detect it.

Key Insight

"Not significant" does not mean "no effect." A study with n = 20 and power = 30% will miss real effects 70% of the time. Finding p = 0.15 in such a study tells you almost nothing - you simply didn't have enough data.

The multiple testing problem

Here's where hypothesis testing breaks down in practice.

The mathematical problem

With a 5% false positive rate per test, what happens when you run multiple tests?

Try It Out

Experiment 1: Set tests = 1 and α = 0.05. Probability of a false positive is exactly 5%.

Experiment 2: Increase to 20 tests. The probability of AT LEAST ONE false positive jumps to 64%!

Experiment 3: Look at the Bonferroni-corrected α. To maintain 5% overall error with 20 tests, each test must use α = 0.0025.

The math: P(≥1 FP) = 1 - (1-α)^n grows rapidly with n.

The Problem

The Multiple Testing Formula

If you run n independent tests at significance level α:

P(at least one false positive) = 1 - (1 - α)^n

For n = 20 and α = 0.05:

P(≥1 FP) = 1 - (0.95)^20 = 1 - 0.358 = 64.2%

Correction methods:

Bonferroni: Use α/n for each test
Benjamini-Hochberg: Control false discovery rate (FDR)
Pre-registration: Specify tests before seeing data

P-hacking in action

P-hacking is running multiple analyses until you find p < 0.05, then reporting only that result. Let's see it in action:

Try It Out

Experiment 1: Test 5 variables. You might get lucky, or you might not.

Experiment 2: Test 20 variables. You'll almost always find something "significant" - even though NOTHING is real.

Experiment 3: When you find a "significant" result, read the fake headline. It sounds convincing! This is how p-hacking produces publishable-looking research about nothing.

Remember: There is NO real effect in this simulation. Every "discovery" is a false positive.

Warning

Real p-hacking techniques include testing multiple outcome variables and reporting only significant ones, removing "outliers" selectively until p < 0.05, collecting data until p < 0.05 then stopping, trying different subgroups (by age, gender, education) until one works, and including or excluding covariates to fish for significance. None of these are fraud in the traditional sense - researchers often don't realize they're doing it. But the result is the same: false positives masquerading as discoveries.

Effect size vs. statistical significance

A result can be statistically significant but practically meaningless. Or practically important but statistically non-significant.

The problem with pure p-values

Try It Out

Experiment 1: Set effect size = 0.05 (tiny) and sample size = 10,000. Watch the p-value become highly significant despite the trivial effect.

Experiment 2: Set effect size = 0.6 (medium-large) and sample size = 20. The effect isn't statistically significant because n is too small.

The insight: Statistical significance tells you whether an effect is distinguishable from zero. It says nothing about whether the effect is large enough to matter.

Effect size benchmarks

Cohen's d is a standardized effect size:

d	Interpretation
0.2	Small effect
0.5	Medium effect
0.8	Large effect

For context: the height difference between 15 and 16-year-old girls is about d ≈ 0.2 (small), the IQ difference between PhD holders and college students is about d ≈ 0.5 (medium), and the height difference between 13 and 18-year-old girls is about d ≈ 0.8 (large).

Key Insight

Always report effect size alongside p-values. A drug that lowers blood pressure by 0.5 mmHg with p = 0.001 is useless. A drug that lowers blood pressure by 10 mmHg with p = 0.07 might save lives. The p-value tells you if the effect is likely real. The effect size tells you if it matters.

Your statistical BS detector

You now have the mathematical foundation to evaluate statistical claims critically.

The seven questions

What was the sample size? (Part 1) Small n means noisy estimates, low power, unreliable results.

Is that mean or median? (Part 2) For skewed data, mean can be misleading. Ask which was used and why.

What's the spread? (Part 2) An average without standard deviation is half the picture.

What's the shape of the data? (Part 3) Symmetric, skewed, bimodal? Shape determines which statistics make sense.

What does "significant" actually mean here? (Part 5) p < 0.05 means "unlikely if nothing is happening," not "95% likely to be real."

How many things did they test? (Part 5) Testing 20 variables and finding 1 significant is expected noise. Ask about pre-registration.

What's the effect size? (Part 5) A tiny effect with p = 0.001 might be statistically significant but practically worthless.

Red flags

Red Flag	What It Might Mean
"The average..." without specifying mean/median	Cherry-picked to sound better
Only p-value, no effect size	Effect might be real but trivially small
One significant finding among many	Classic p-hacking
Very small sample size	Too noisy to trust
"Nearly significant" (p = 0.06)	Trying to spin a null result
No confidence interval	Hiding uncertainty
"The data proves..."	Statistics never proves; it gives probability

The replication crisis

In 2015, researchers tried to replicate 100 psychology studies published in top journals. Only 36% replicated successfully.

Why so many false positives?

The math we've derived explains this. Publication bias means journals prefer p < 0.05, so only "significant" results get published. P-hacking means researchers (often unknowingly) try multiple analyses until something works. Low power means many studies are too small to detect realistic effects. And even legitimate studies have a 5% false positive rate.

The result: published literature is enriched with false positives.

The solution

The Problem

How to fix the replication crisis

Pre-registration: specify hypotheses and analysis plan before collecting data. Larger samples: power > 80% for expected effects. Replications: value replication studies as much as novel findings. Effect sizes: report and evaluate effect sizes, not just p-values. Bayesian methods: consider alternatives to null hypothesis testing.

Warning

A single study with p < 0.05 is weak evidence. The threshold isn't "proof" - it's a starting point. Extraordinary claims require extraordinary evidence: multiple replications, large samples, mechanistic understanding. When someone says "studies show..." ask: How many studies? What were the sample sizes? Did they replicate? What's the effect size?

What we derived

From first principles across this 5-part series, we established the foundations of statistical inference.

In Part 1 (Sampling), we proved E[x̄] = μ (sample means are unbiased), SE = σ/√n (the square root law), and that to halve error you must quadruple n.

In Part 2 (Central Tendency & Spread), we derived breakdown points of 0% for mean and 50% for median, and showed where σ = √[Σ(x-μ)²/n] comes from.

In Part 3 (Distributions), we derived the normal PDF: f(x) = (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)), and showed that the 68-95-99.7 rule emerges from integration.

In Part 4 (CLT), we proved that sample means approach a normal distribution regardless of source, with convergence rate depending on skewness.

In Part 5 (Hypothesis Testing), we derived p-value = P(data this extreme | H₀ true), showed that P(≥1 FP in n tests) = 1 - (1-α)^n, and emphasized that effect size matters more than p-values.

Key Insight

Statistics is not about certainty. It's about quantifying uncertainty and making good decisions despite incomplete information. When someone presents a statistical claim with certainty, be skeptical. The math we've derived shows that uncertainty is inherent. The best we can do is measure how wrong we might be - and make good decisions anyway.

Part 5: Hypothesis Testing & P-Hacking

Deriving hypothesis testing

The core problem

The logic of testing

The Problem

What p-values actually mean

Type I and Type II errors

The error matrix

Statistical Power = 95.5%

The Power Formula

The power formula

The multiple testing problem

The mathematical problem

The Mathematical Derivation

The Bonferroni Correction

P-hacking in action

Effect size vs. statistical significance

The problem with pure p-values

The Mathematics

Effect size benchmarks

Your statistical BS detector

The seven questions

Red flags

The replication crisis

Why so many false positives?

The solution

What we derived