Part 3: Shape & The Normal Distribution

In Part 2, we derived why mean and median differ when data has outliers, and proved the standard deviation formula from first principles.

But we glossed over something important: shape. The shape of your data determines which statistics are meaningful. In this part, we'll derive why shape matters mathematically, and where the famous normal distribution comes from.

The mathematics of skewness

When someone says data is "right-skewed" or "left-skewed," what exactly do they mean mathematically? Let's derive the relationship between shape and statistics.

The mean as a balance point

Recall from Part 2 that the mean is defined as:

μ = (1/n) × Σxᵢ

This can be reinterpreted as a balance point. Imagine your data as weights on a seesaw. The mean is the position where the seesaw balances.

The Problem

Key Insight: The mean minimizes the sum of squared distances:

μ = argmin Σ(xᵢ - c)²

Taking the derivative and setting to zero proves that c = μ is the solution.

Why skewness causes mean ≠ median

In a symmetric distribution, the mean and median coincide. But in skewed distributions, they diverge. Here's the mathematical reason.

For right-skewed data:

The distribution has a "floor" (often zero) that limits how far left values can go, but no ceiling on the right. This asymmetry means:

Σ(xᵢ - M) for xᵢ > M  >>  |Σ(xᵢ - M)| for xᵢ < M

where M is the median. Since the mean must balance these forces, it gets pulled toward the longer tail.

Try It Out

Experiment 1: Start with Symmetric distribution. Notice mean and median nearly coincide - both are valid measures of center.

Experiment 2: Switch to Right-Skewed. Watch the mean (red line) jump to the right of the median (green line). Click "Show Mathematical Proof" to see why.

Experiment 3: Try Left-Skewed and observe the opposite effect: mean < median.

The key: Skewness quantifies this asymmetry mathematically. Skewness > 0 means mean > median.

The skewness formula

We can quantify asymmetry with the skewness coefficient:

The Problem

Definition: Skewness

γ = E[(X - μ)³] / σ³

Where:

The numerator captures asymmetry (cubing preserves sign)
The denominator normalizes by spread

Interpretation:

γ > 0: Right-skewed (long right tail)
γ < 0: Left-skewed (long left tail)
γ ≈ 0: Symmetric

Why cube? Squaring (as in variance) makes all deviations positive, hiding asymmetry. Cubing preserves sign: negative deviations stay negative, positive stay positive. If the right tail extends further, positive cubed deviations dominate, giving positive skewness.

Key Insight

The skewness formula explains everything. For symmetric data, E[(X - μ)³] = 0 because positive and negative cubed deviations cancel. For right-skewed data, large positive values cubed dominate, giving γ > 0. For left-skewed data, large negative values cubed dominate, giving γ < 0.

This is why mean > median for right-skewed data: the same large values that create positive skewness also pull the mean rightward.

The bimodal trap

Before we move to the normal distribution, there's a warning about bimodal data.

When neither mean nor median works

With bimodal data (two distinct peaks), something strange happens: both mean and median fall in the valley between peaks - exactly where no data actually exists.

Try It Out

Experiment 1: Increase the peak separation slider. Watch as mean and median both stay in the middle - the empty valley.

Experiment 2: Click "Show The Solution" to see why splitting into subgroups is necessary.

The insight: For bimodal data, a single summary statistic is meaningless. You must split the data into groups first.

Warning

Real-world bimodal examples include customer behavior (casual vs. power users), test scores (prepared vs. unprepared students), and wait times (with and without system issues). In each case, reporting a single "average" hides the true story. Always visualize your data before computing statistics.

Deriving the normal distribution

The normal distribution (Gaussian distribution, bell curve) appears everywhere in nature. But where does its formula come from? Let's derive it from first principles.

What we want

We need a probability density function f(x) that has a peak at the center (μ) because the most likely value is the mean, decreases with distance from μ, is symmetric (positive and negative deviations are equally likely), and integrates to 1 (total probability must be 100%).

Building the formula step by step

Try It Out

Step through the derivation: Use Previous/Next to walk through each logical step. Watch how we build up from requirements to the final formula.

Adjust the parameters: Move μ and σ sliders to see how they affect the curve:

μ shifts the entire curve left/right
σ makes the curve wider/narrower

The Problem

The Normal PDF (Probability Density Function)

f(x) = (1 / (σ√(2π))) × exp(-(x - μ)² / (2σ²))

Breaking it down:

exp(-(x - μ)²/(2σ²)): The exponential decay from center
(1/(σ√(2π))): The normalizing constant (makes integral = 1)
μ: The center (mean)
σ: The spread (standard deviation)

The entire distribution is completely determined by just μ and σ.

Why this shape appears everywhere

The normal distribution emerges naturally in several situations. First, when many small independent effects combine - your height is affected by thousands of genes, nutrition factors, environmental effects, each contributing a tiny random amount, and their sum converges to normal (the Central Limit Theorem, which we'll cover in Part 4). Second, given only that you know the mean and variance, the normal distribution is the "most uncertain" distribution - it assumes nothing extra (this is called maximum entropy). Third, if X and Y are both normal, then X + Y is also normal - no other continuous distribution has this stability property.

Key Insight

The normal distribution isn't arbitrary - it's inevitable. Any time you have many small independent random effects adding together, the result will be approximately normal. This is why it appears in heights (many genetic factors), measurement errors (many small perturbations), test scores (many independent question difficulties), and manufacturing tolerances (many production variations).

The 68-95-99.7 rule derived

You've probably memorized that 68% of data falls within ±1σ, 95% within ±2σ, and 99.7% within ±3σ. But where do these numbers come from? They're not arbitrary - they're determined by the integral of the normal PDF.

The calculation

To find the probability within ±kσ, we integrate:

P(μ - kσ < X < μ + kσ) = ∫ f(x) dx

where the integral runs from μ - kσ to μ + kσ.

The 68-95-99.7 Rule: Why These Numbers?

Deriving the 68.27% from Integration

Step 1: Set up the integral

P(μ - 1σ < X < μ + 1σ) = ∫ f(x) dx

from x = μ - 1σ to x = μ + 1σ

Step 2: Substitute the normal PDF

= ∫ (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)) dx

Step 3: Change variables (z = (x-μ)/σ)

= ∫ (1/√(2π)) × exp(-z²/2) dz

from z = -1 to z = 1

Step 4: Use the error function

= erf(1/√2) × 2 / 2

= 68.27%

Range	Percentage Inside	Percentage Outside	Interpretation
±1σ	68.27%	31.73%	Normal variation
±2σ	95.45%	4.55%	Unusual (p < 0.05)
±3σ	99.73%	0.27%	Very rare

Why These Specific Numbers?

The 68-95-99.7 values aren't chosen arbitrarily - they emerge from integrating the normal distribution. The ±2σ threshold (95.45%) is why we use p < 0.05 for statistical significance: if only ~5% of values are that extreme by chance, observing one suggests something other than chance.

Try It Out

Click each sigma range: Select ±1σ, ±2σ, or ±3σ to see the shaded area change. The percentage shown is the result of integrating the normal PDF over that range.

Read the derivation: The step-by-step integration shows how we go from the integral to the final percentage.

Note the 95%: The ±2σ range covers 95.45% of the distribution. This is why "p < 0.05" is the standard for statistical significance - only ~5% of values are more than 2σ from the mean by chance.

The error function connection

The integral of the normal PDF doesn't have a closed form in terms of elementary functions. Instead, we define the error function (erf):

erf(z) = (2/√π) × ∫₀ᶻ exp(-t²) dt

The probability within ±kσ is then:

P = erf(k/√2)

Results:

k	erf(k/√2)	Percentage
1	0.6827	68.27%
2	0.9545	95.45%
3	0.9973	99.73%

Key Insight

The 68-95-99.7 rule emerges from integration, not convention. When we integrate the normal PDF from μ-2σ to μ+2σ, we get exactly 95.45%. This mathematical fact - not arbitrary choice - is why statisticians use p < 0.05 as the significance threshold. Values more than 2σ away happen less than 5% of the time by chance alone.

When to assume normality

The normal distribution is powerful, but assuming it when it doesn't apply leads to wrong conclusions.

When normality is reasonable

Normality is reasonable for measurement with random error, where scientific instruments have many small random perturbations that add together. It also applies to aggregates of independent effects like heights, blood pressure, and IQ scores. And some data is explicitly designed to be normal - standardized tests like the SAT and GRE are normed to produce bell curves.

When normality is not reasonable

Warning

Never assume normality for income and prices (right-skewed with floor at zero - a house costs $300K, not -$50K), time-based metrics like response times and load times (always right-skewed - a page loads in 100ms usually, but sometimes 10,000ms), counts like website visits and purchases (floor at zero, potentially unbounded above), or anything with a hard floor or ceiling (data piles up against limits, creating asymmetry).

Quick normality checks

Before assuming normality:

Check	What to Look For
Histogram	Roughly bell-shaped? Single peak?
Mean vs Median	Differ by > 10-20%? Probably skewed
Floor/Ceiling	Hard limits cause asymmetry
Data Type	Money, time, counts → almost never normal

What we derived

From first principles, we established the skewness formula (γ = E[(X - μ)³] / σ³) to quantify asymmetry, and showed why skewness > 0 implies mean > median. We identified the bimodal trap - neither mean nor median works when data has two peaks. We derived the normal PDF: f(x) = (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)). And we showed that the 68-95-99.7 rule emerges from integrating this PDF.

The normal distribution isn't magical - it's the natural result of many independent random effects combining. And the 95% threshold for significance isn't arbitrary - it's the probability of being within ±2σ.

In Part 4, we'll prove the Central Limit Theorem - the reason normal distributions appear even when individual data points are far from normal.

Part 3: Shape & The Normal Distribution

The mathematics of skewness

The mean as a balance point

Why skewness causes mean ≠ median

The skewness formula

The bimodal trap

When neither mean nor median works

The Problem

Deriving the normal distribution

What we want

Building the formula step by step

The Problem: Modeling Randomness

Why this shape appears everywhere

The 68-95-99.7 rule derived

The calculation

Deriving the 68.27% from Integration

Why These Specific Numbers?

The error function connection

When to assume normality

When normality is reasonable

When normality is not reasonable

Quick normality checks

What we derived