Vibing with Statistics
Your A/B test shows a 12% lift in conversions. The data scientist is confident. The PM is ready to ship.
But should you?
That 12% could be signal, a real improvement your team created. Or it could be noise dressed up in a suit. Random variation that happened to look impressive this time.
The difference between these scenarios is statistics. By the end of this guide, you'll know what questions to ask when someone presents data, and you'll catch mistakes that would have sailed right past you before.
Part 1: The Sampling Problem
You almost never have data on everyone. You have data on some people, a sample. And you're trying to figure out what's true for the whole population.
Want to know your customers' average satisfaction score? You could survey all 100,000 of them. Or you could survey 200 and extrapolate.
The question is: how close is your sample average to the true average?
Question: How big of a sample do you need to trust the results? When is N=10 enough vs N=100?
Try this: Set sample size to 10 and draw 10 samples. See how scattered they are. Now try sample size 100.
Small samples are noisy. Each one gives wildly different answers. Large samples cluster tightly around the truth.
This is the fundamental tension: samples are variable. The question is always: how much should I trust this particular sample?
Part 2: Averages Can Lie
When someone gives you a summary statistic, it's usually one of two things:
Mean: Add up all values, divide by count. What most people call "the average."
Median: The middle value when you line everything up. Half above, half below.
For symmetric data, they're nearly identical. But add some outliers and watch what happens.
Question: If someone quotes "the average," which one did they choose and why?
Try this: Add a high outlier (180). Watch the mean jump while the median stays steady.
A single extreme value can drag the mean far from where most data lives. The median barely moves.
This is why we report "median household income" instead of "mean household income." A few billionaires would make the "average" household look much richer than it is.
When someone quotes an average, ask: is this mean or median? And more importantly: why did they choose that one?
Part 3: Same Average, Different Reality
Two products both have a 4.2 star rating. Same average. Should you care which one you buy?
One has reviews clustered tightly around 4 stars. The other has a mix of 1-star and 5-star reviews. The average tells you nothing about this.
You need to know the spread, measured by standard deviation: roughly, the average distance from the mean.
Question: Two products both have 4.2 stars. Does that tell you everything you need to know?
Try this: Adjust the spread sliders. Watch how the same average (3.0 stars) can mean very different things.
Same mean. Completely different stories. A standard deviation of 5 means most values cluster near the center. A standard deviation of 25 means they're scattered everywhere.
An average without a spread is only half the story.
Part 4: Shape Matters
Not all data follows the same pattern. Some clusters in the middle (symmetric). Some has a long tail to the right (right-skewed). Some has two peaks (bimodal).
Why does this matter? Because the shape determines which statistics make sense.
Question: When is it okay to use mean vs. when should you insist on median?
Try this: Toggle through each shape. Notice where mean and median diverge.
For symmetric data, mean and median work equally well. For right-skewed data (income, home prices, website visits), the mean gets pulled toward the tail. Use median. For bimodal data, neither tells the real story. You're probably looking at two distinct groups mashed together.
Before trusting any summary statistic, ask: what shape is this data?
Part 5: The Bell Curve Everyone Talks About
The normal distribution shows up everywhere: heights, test scores, measurement errors, manufacturing tolerances. It's described completely by two numbers: the mean (center) and the standard deviation (spread).
Question: Why do we use "2 standard deviations" as the threshold for "unusual"?
Try this: Toggle between ±1σ, ±2σ, and ±3σ. See what percentage of data falls within each range.
The famous "68-95-99.7 rule":
- 68% of data falls within 1 standard deviation of the mean
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations
This is why "2 standard deviations from the mean" is a common threshold for "unusual." Only about 5% of values are that extreme.
Part 6: The Magic Theorem
Here's something that still feels like magic.
Take any weird, non-normal distribution. Exponential. Bimodal. Something ugly from production logs. Now take samples from it. Not individual values, but groups of 30. Calculate the mean of each group.
Plot those sample means. They'll form a normal distribution. Always.
This is the Central Limit Theorem, and it's why sampling works at all.
Question: Why can pollsters survey 1,000 people and predict behavior of 330 million?
Try this: Pick "Exponential" (very non-normal). Take 100 samples. Watch the sample means form a bell curve anyway.
Pick the weirdest source distribution. Take 100 samples. Watch the means become a bell curve.
This is why pollsters can survey 1,000 people and make predictions about 330 million. The sample means follow predictable patterns even when individual data points don't.
Part 7: The p-value Trap
Your A/B test shows version B converted at 4.2% versus version A at 3.8%. Ship it?
Not so fast. What if this is just random noise?
This is where hypothesis testing comes in. The standard approach:
- Assume there's no real difference (the "null hypothesis")
- Calculate how likely you'd see a result this extreme if there really was no difference
- If that probability is below 5%, call it "statistically significant"
That probability is the p-value.
Here's the part people mess up: a p-value of 0.03 does NOT mean "97% chance the result is real." It means "if there's no real effect, you'd see a result this extreme 3% of the time by chance."
Question: If there's NO real effect, how often will you still get "significant" results (p < 0.05)?
Try this: Set true effect to 0 and run 20 tests. Even with NO real difference, how many show p < 0.05?
Set the true effect to zero. Run 20 tests. Even with no real difference, you'll find "significant" results about 5% of the time.
That's the false positive rate. That's what p < 0.05 means.
Part 8: How to Lie with Statistics
Imagine you're a researcher. You've tested one hypothesis and it's not significant. Your boss is unhappy. So you try analyzing by gender. Not significant. By age group. Not significant. By income bracket. Still nothing.
Then you try by education level. p = 0.04. Significant!
You publish: "Education Level Predicts Outcome (p < 0.05)."
This is p-hacking, and it's disturbingly common.
The problem: with a 5% false positive rate, if you test 20 things, you have about a 64% chance of finding at least one "significant" result by pure chance.
Question: If someone found ONE significant variable out of many tested, should you believe it?
Try this: Test just 5 variables first. How often do you find "significance"? Now test all 20.
There's NO real effect in this simulation. Every variable is pure noise. Click through and watch "significant" results appear anyway.
A 2011 paper titled "False-Positive Psychology" showed that by selectively analyzing data, researchers could demonstrate that listening to "When I'm Sixty-Four" by the Beatles literally made people younger (p < 0.05).
Your Statistical BS Detector
Here's your checklist for the next time someone presents data:
1. How big was the sample? Small samples are noisy. Is it big enough to be reliable?
2. Mean or median? For skewed data, median is usually more honest. Someone quoting mean income might be hiding something.
3. What's the spread? An average without a standard deviation is only half the picture.
4. What's the shape? Symmetric, skewed, bimodal? The shape determines which statistics make sense.
5. What does "significant" mean here? A p-value of 0.04 means 4% chance of seeing this result if nothing is happening. Not 96% chance the result is real.
6. How many things did they test? If they tested 20 variables and one was significant, that's expected noise. Ask about the analysis plan.
7. Would this change my decision? A 0.1% improvement that's p < 0.01 is still a 0.1% improvement. Is that worth caring about?
The Takeaway
Statistics is fundamentally about uncertainty. We rarely know the truth. We have clues, samples, measurements, experiments, and we're trying to make good decisions despite incomplete information.
When someone says "the data shows X," you now know to ask: What data? How was it collected? What else did you test? What's the uncertainty?
Those questions don't require you to run analyses yourself. They just require knowing that uncertainty exists, and pushing back when someone pretends it doesn't.
Statistics isn't about having the right answer. It's about knowing how wrong you might be.