Part 3: Shape & The Normal Distribution
In Part 2, we derived why mean and median differ when data has outliers, and proved the standard deviation formula from first principles.
But we glossed over something important: shape. The shape of your data determines which statistics are meaningful. In this part, we'll derive why shape matters mathematically, and where the famous normal distribution comes from.
The mathematics of skewness
When someone says data is "right-skewed" or "left-skewed," what exactly do they mean mathematically? Let's derive the relationship between shape and statistics.
The mean as a balance point
Recall from Part 2 that the mean is defined as:
μ = (1/n) × Σxᵢ
This can be reinterpreted as a balance point. Imagine your data as weights on a seesaw. The mean is the position where the seesaw balances.
Key Insight: The mean minimizes the sum of squared distances:
μ = argmin Σ(xᵢ - c)²
Taking the derivative and setting to zero proves that c = μ is the solution.
Why skewness causes mean ≠ median
In a symmetric distribution, the mean and median coincide. But in skewed distributions, they diverge. Here's the mathematical reason.
For right-skewed data:
The distribution has a "floor" (often zero) that limits how far left values can go, but no ceiling on the right. This asymmetry means:
Σ(xᵢ - M) for xᵢ > M >> |Σ(xᵢ - M)| for xᵢ < M
where M is the median. Since the mean must balance these forces, it gets pulled toward the longer tail.
Experiment 1: Start with Symmetric distribution. Notice mean and median nearly coincide - both are valid measures of center.
Experiment 2: Switch to Right-Skewed. Watch the mean (red line) jump to the right of the median (green line). Click "Show Mathematical Proof" to see why.
Experiment 3: Try Left-Skewed and observe the opposite effect: mean < median.
The key: Skewness quantifies this asymmetry mathematically. Skewness > 0 means mean > median.
The skewness formula
We can quantify asymmetry with the skewness coefficient:
Definition: Skewness
γ = E[(X - μ)³] / σ³
Where:
- The numerator captures asymmetry (cubing preserves sign)
- The denominator normalizes by spread
Interpretation:
- γ > 0: Right-skewed (long right tail)
- γ < 0: Left-skewed (long left tail)
- γ ≈ 0: Symmetric
Why cube? Squaring (as in variance) makes all deviations positive, hiding asymmetry. Cubing preserves sign: negative deviations stay negative, positive stay positive. If the right tail extends further, positive cubed deviations dominate, giving positive skewness.
The skewness formula explains everything. For symmetric data, E[(X - μ)³] = 0 because positive and negative cubed deviations cancel. For right-skewed data, large positive values cubed dominate, giving γ > 0. For left-skewed data, large negative values cubed dominate, giving γ < 0.
This is why mean > median for right-skewed data: the same large values that create positive skewness also pull the mean rightward.
The bimodal trap
Before we move to the normal distribution, there's a warning about bimodal data.
When neither mean nor median works
With bimodal data (two distinct peaks), something strange happens: both mean and median fall in the valley between peaks - exactly where no data actually exists.
Experiment 1: Increase the peak separation slider. Watch as mean and median both stay in the middle - the empty valley.
Experiment 2: Click "Show The Solution" to see why splitting into subgroups is necessary.
The insight: For bimodal data, a single summary statistic is meaningless. You must split the data into groups first.
Real-world bimodal examples include customer behavior (casual vs. power users), test scores (prepared vs. unprepared students), and wait times (with and without system issues). In each case, reporting a single "average" hides the true story. Always visualize your data before computing statistics.
Deriving the normal distribution
The normal distribution (Gaussian distribution, bell curve) appears everywhere in nature. But where does its formula come from? Let's derive it from first principles.
What we want
We need a probability density function f(x) that has a peak at the center (μ) because the most likely value is the mean, decreases with distance from μ, is symmetric (positive and negative deviations are equally likely), and integrates to 1 (total probability must be 100%).
Building the formula step by step
Step through the derivation: Use Previous/Next to walk through each logical step. Watch how we build up from requirements to the final formula.
Adjust the parameters: Move μ and σ sliders to see how they affect the curve:
- μ shifts the entire curve left/right
- σ makes the curve wider/narrower
The Normal PDF (Probability Density Function)
f(x) = (1 / (σ√(2π))) × exp(-(x - μ)² / (2σ²))
Breaking it down:
exp(-(x - μ)²/(2σ²)): The exponential decay from center(1/(σ√(2π))): The normalizing constant (makes integral = 1)μ: The center (mean)σ: The spread (standard deviation)
The entire distribution is completely determined by just μ and σ.
Why this shape appears everywhere
The normal distribution emerges naturally in several situations. First, when many small independent effects combine - your height is affected by thousands of genes, nutrition factors, environmental effects, each contributing a tiny random amount, and their sum converges to normal (the Central Limit Theorem, which we'll cover in Part 4). Second, given only that you know the mean and variance, the normal distribution is the "most uncertain" distribution - it assumes nothing extra (this is called maximum entropy). Third, if X and Y are both normal, then X + Y is also normal - no other continuous distribution has this stability property.
The normal distribution isn't arbitrary - it's inevitable. Any time you have many small independent random effects adding together, the result will be approximately normal. This is why it appears in heights (many genetic factors), measurement errors (many small perturbations), test scores (many independent question difficulties), and manufacturing tolerances (many production variations).
The 68-95-99.7 rule derived
You've probably memorized that 68% of data falls within ±1σ, 95% within ±2σ, and 99.7% within ±3σ. But where do these numbers come from? They're not arbitrary - they're determined by the integral of the normal PDF.
The calculation
To find the probability within ±kσ, we integrate:
P(μ - kσ < X < μ + kσ) = ∫ f(x) dx
where the integral runs from μ - kσ to μ + kσ.
Click each sigma range: Select ±1σ, ±2σ, or ±3σ to see the shaded area change. The percentage shown is the result of integrating the normal PDF over that range.
Read the derivation: The step-by-step integration shows how we go from the integral to the final percentage.
Note the 95%: The ±2σ range covers 95.45% of the distribution. This is why "p < 0.05" is the standard for statistical significance - only ~5% of values are more than 2σ from the mean by chance.
The error function connection
The integral of the normal PDF doesn't have a closed form in terms of elementary functions. Instead, we define the error function (erf):
erf(z) = (2/√π) × ∫₀ᶻ exp(-t²) dt
The probability within ±kσ is then:
P = erf(k/√2)
Results:
| k | erf(k/√2) | Percentage |
|---|---|---|
| 1 | 0.6827 | 68.27% |
| 2 | 0.9545 | 95.45% |
| 3 | 0.9973 | 99.73% |
The 68-95-99.7 rule emerges from integration, not convention. When we integrate the normal PDF from μ-2σ to μ+2σ, we get exactly 95.45%. This mathematical fact - not arbitrary choice - is why statisticians use p < 0.05 as the significance threshold. Values more than 2σ away happen less than 5% of the time by chance alone.
When to assume normality
The normal distribution is powerful, but assuming it when it doesn't apply leads to wrong conclusions.
When normality is reasonable
Normality is reasonable for measurement with random error, where scientific instruments have many small random perturbations that add together. It also applies to aggregates of independent effects like heights, blood pressure, and IQ scores. And some data is explicitly designed to be normal - standardized tests like the SAT and GRE are normed to produce bell curves.
When normality is not reasonable
Never assume normality for income and prices (right-skewed with floor at zero - a house costs $300K, not -$50K), time-based metrics like response times and load times (always right-skewed - a page loads in 100ms usually, but sometimes 10,000ms), counts like website visits and purchases (floor at zero, potentially unbounded above), or anything with a hard floor or ceiling (data piles up against limits, creating asymmetry).
Quick normality checks
Before assuming normality:
| Check | What to Look For |
|---|---|
| Histogram | Roughly bell-shaped? Single peak? |
| Mean vs Median | Differ by > 10-20%? Probably skewed |
| Floor/Ceiling | Hard limits cause asymmetry |
| Data Type | Money, time, counts → almost never normal |
What we derived
From first principles, we established the skewness formula (γ = E[(X - μ)³] / σ³) to quantify asymmetry, and showed why skewness > 0 implies mean > median. We identified the bimodal trap - neither mean nor median works when data has two peaks. We derived the normal PDF: f(x) = (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)). And we showed that the 68-95-99.7 rule emerges from integrating this PDF.
The normal distribution isn't magical - it's the natural result of many independent random effects combining. And the 95% threshold for significance isn't arbitrary - it's the probability of being within ±2σ.
In Part 4, we'll prove the Central Limit Theorem - the reason normal distributions appear even when individual data points are far from normal.