Back to Blog

Part 2: Central Tendency & Spread

In Part 1, we learned that larger samples give more reliable estimates. But once you have your sample, what do you do with it? Usually, you summarize it with a single number - an "average."

But which average? And why does this choice matter mathematically?


Two definitions of "center"

When someone says "average," they usually mean one of two things.

The mean (μ) is the arithmetic average:

μ = (x₁ + x₂ + ... + xₙ) / n = (1/n) × Σxᵢ

The median is the middle value when sorted:

Median = x₍₍n+1₎/2₎ for odd n
       = (x₍n/2₎ + x₍n/2+1₎)/2 for even n

Both measure "center," but they define it differently. The mean is a balance point - the value where the data "balances" like a seesaw. The median is a position - literally the middle element.

The Problem

Key Question: If both measure the center, when does it matter which one you use?

Answer: When your data has outliers. Let's prove why mathematically.


Why outliers break the mean

The mathematical vulnerability

Consider a dataset [10, 12, 14, 16, 18] with mean = 14 and median = 14.

Now add one outlier: 100.

New mean:

μ' = (10 + 12 + 14 + 16 + 18 + 100) / 6
   = 170 / 6
   = 28.33

New median:

Sorted: [10, 12, 14, 16, 18, 100]
Middle values: 14 and 16
Median = (14 + 16) / 2 = 15

The mean jumped from 14 to 28.33 (+102%), while the median only moved from 14 to 15 (+7%).

Mean Sensitivity: The Mathematical Breakdown

Original Dataset

10
12
14
16
18
14.0
mean

With Outlier

10
12
14
16
18
100
28.3
mean (+14.3)
Try It Out

Experiment 1: Start with the default outlier value of 100. See how dramatically the mean shifts compared to the original.

Experiment 2: Click "Show Mathematical Derivation" to see the step-by-step calculation of exactly how much the mean shifts.

Experiment 3: Try extreme outlier values (150, 200). Watch how the percentage change in mean grows with outlier magnitude.

The key insight: The mean's formula includes every value directly, so extreme values have outsized influence.

The breakdown point

How do we quantify this vulnerability? We use the breakdown point - the minimum fraction of contaminated data points needed to make an estimator arbitrarily wrong.

The Problem

Definition: Breakdown Point

The breakdown point of an estimator is the smallest proportion of observations that must be replaced by arbitrary values to make the estimator give arbitrarily large or small values.

Mean's breakdown point: 0%

Proof: Take any dataset. Replace just ONE value with M. As M → ∞:

Mean' = (Original Sum - replaced value + M) / n
      → ∞ as M → ∞

Even 1/n contamination can make the mean arbitrarily large. As n → ∞, this proportion → 0.

Median's breakdown point: 50%

Proof: The median is determined by the middle position. To change the median, you must change which value occupies the middle position. This requires moving more than half the data points past the original median.

- Replace 49% of points with ∞: median unchanged (middle is still original data)
- Replace 51% of points with ∞: median = ∞

Therefore, breakdown point = 50%.

Breakdown Point: Mean vs Median

Mean

BROKEN
Clean value:59.00
Contaminated:75.60
Shift:16.60
Breakdown point: 0%
Mean breaks with ANY contamination

Median

OK
Clean value:60.00
Contaminated:60.00
Shift:0.00
Breakdown point: 50%
Median resists contamination up to 50%

The Math

Breakdown point: Minimum % of contamination needed to make estimator arbitrarily bad
Mean:
Adding k outliers of value M shifts mean by: (k × M) / (n + k)
→ Even 1 outlier (k=1) causes shift → breakdown point = 0%
Median:
Median = middle value after sorting. Changes only when middle position shifts.
→ Requires outliers > 50% to shift middle → breakdown point = 50%
Current: 20% contamination (2/10 points)
Mean shift: 16.60 | Median shift: 0.00
Try It Out

Experiment 1: Set contamination to 20%. See the mean shift dramatically while the median stays stable.

Experiment 2: Increase contamination to 45%. Median still unchanged!

Experiment 3: Push contamination to 50% or higher. NOW the median breaks.

The math in action: You can contaminate nearly half your data and the median will still give you a valid answer. The mean breaks with a single bad point.

Key Insight

The breakdown point explains everything. The mean (0% breakdown) is optimal when data is clean but useless with any contamination. The median (50% breakdown) is robust to nearly half your data being garbage.

This is why income statistics use median (contaminated by billionaires) but physics experiments use mean (carefully controlled data).


Deriving the standard deviation

We've established that the mean and median tell us about center. But two datasets can have the same center yet be completely different. Dataset A: [49, 50, 51] has mean = 50 and is very tight. Dataset B: [0, 50, 100] also has mean = 50 but is very spread out.

We need a measure of spread. Let's derive one from first principles.

Attempt 1: average deviation (fails!)

Intuition: measure how far each point is from the mean, then average.

Average Deviation = Σ(xᵢ - μ) / n

Problem: positive and negative deviations cancel!

For [49, 50, 51] with μ = 50:

Deviations: -1, 0, +1
Sum: -1 + 0 + 1 = 0
Average deviation = 0

This is useless - it always equals zero (which we can prove: Σ(xᵢ - μ) = Σxᵢ - nμ = nμ - nμ = 0).

Attempt 2: average absolute deviation

Fix the cancellation by taking absolute values:

MAD = Σ|xᵢ - μ| / n

This works! For [49, 50, 51]: MAD = (1 + 0 + 1) / 3 = 0.67

But absolute values are mathematically inconvenient (not differentiable at zero, harder to work with algebraically).

The standard deviation

Instead of absolute values, square the deviations (makes them all positive), then take the square root at the end:

The Problem

Derivation of Standard Deviation

Step 1: Calculate squared deviations

(xᵢ - μ)²  for each i

Step 2: Average them (this is the variance)

σ² = Σ(xᵢ - μ)² / n

Step 3: Take square root (to return to original units)

σ = √[Σ(xᵢ - μ)² / n]

Why square then square root? Squaring makes all terms positive (solves cancellation) and penalizes larger deviations more (4² = 16 vs 2² = 4). Taking the square root returns to original units - if data is in dollars, σ is in dollars.

Standard Deviation: Why √[Σ(x-μ)²/n]?
Step 1 of 7

The Problem: Measuring Spread

We want to quantify how far data points are from the mean.

Goal: Find a single number that captures "spread"

Dataset Visualization

μ = 14.01012141618
Try It Out

Step through the derivation: Use Previous/Next to walk through each mathematical step. On each step, the visualization updates to show what's being calculated.

Watch the squaring: When you reach step 3, see how the squared deviations appear below each data point. Notice that the point furthest from the mean contributes the most to σ.

Compare the final result: Step 6 shows how taking the square root gives us a value in the same units as our original data.

Key Insight

Why use σ instead of MAD? First, variance (σ²) has nice algebraic properties - Var(X + Y) = Var(X) + Var(Y) for independent variables, which doesn't hold for MAD. Second, for normal data, about 68% falls within ±1σ and 95% within ±2σ. Third, the whole machinery of statistics is built on variance - remember Standard Error = σ/√n from Part 1.


Same mean, different reality

Here's where all this matters in practice.

Product reviews

Two products both have a mean rating of 4.0 stars.

Product A has σ = 0.3 stars. Almost all reviews fall between 3.7 and 4.3 stars - a consistent experience, low-risk purchase.

Product B has σ = 1.5 stars. Reviews spread from 2.5 to 5 - people love it or hate it. High-risk purchase.

The mean tells you nothing about this difference. Standard deviation reveals it.

Median Robustness: The 50% Breakdown Point

Dataset

10
12
14
16
18

Statistics

Original Median:14
Current Median:14
Change:0.0
Original Mean:14.0
Current Mean:14.0
Change:0.0
Outlier %:0.0%

The Math

Original: [10, 12, 14, 16, 18] → median = 14
Sorted: [10, 12, 14, 16, 18] → middle value = 14
Try It Out

Experiment with outliers: Add outliers one at a time and watch the statistics update.

Watch the median resist: Even with 4 outliers (out of 9 total points), the median barely budges while the mean has shifted dramatically.

Find the threshold: Keep adding outliers until the median finally breaks. Count how many outliers it took - that's the 50% breakdown point in action!

When to use what

SituationUse MeanUse Median
Symmetric, clean data
Known outliers exist
Reporting income/prices
Physical measurements (carefully controlled)Either
You need to do further calculationsDepends
Spread MeasureWhen to Use
Standard Deviation (σ)Clean, symmetric data; when you'll use it for further statistics
IQR (Interquartile Range)Data with outliers; robust alternative
RangeQuick sanity check only; very sensitive to outliers
Warning

When someone reports just one summary statistic, they're hiding information. "Average salary: $100,000" - is that mean or median? What's the spread? "Average rating: 4.5 stars" - standard deviation of 0.2 or 1.5? "Average return: 8%" - what was the worst year?

Always ask for both center and spread. Better yet, ask to see the distribution.


What we derived

Starting from the question "how do we summarize data?", we derived the mean (Σxᵢ/n) as a balance point with 0% breakdown point, and the median as a position with 50% breakdown point. We showed why outliers break the mean - the formula includes every value directly. And we derived the standard deviation (√[Σ(xᵢ-μ)²/n]) from the need to measure spread without cancellation.

The choice between mean and median isn't arbitrary - it's determined by your data's contamination level and your tolerance for risk.

In Part 3, we'll look at shape - and derive the Central Limit Theorem that explains why sample means are normally distributed regardless of the population shape.