What is Simpson's paradox, in simple terms?

Simpson's paradox is when a trend that appears in every subgroup of data reverses or disappears once the subgroups are combined into one total. It happens because the pooled figure is a weighted average, and an uneven mix of group sizes can pull the overall number against the direction of every group inside it. Nothing is miscounted — the group numbers and the combined number are all true; they simply answer different questions.

Simpson's Paradox — unspurious: a compendium of statistical illusions

01 · What just happened

One scatter plot, two honest answers

Neither view of Figure 1 is a trick. The upward line is the genuine least-squares fit to all 54 patients. The three downward lines are the genuine fits within each age band. Both are arithmetically correct — and they recommend opposite things.

The reversal happens because a third variable, age, is tangled up with both axes. Older patients in this clinic exercise more (their doctors insist on it) and also have higher cholesterol (age does that). When you pool everyone, you aren't measuring the effect of exercise any more — you're mostly measuring the effect of being old enough to be told to exercise.

A pooled trend is a weighted blend of the within-group trends plus the trend between the groups themselves. When the between-group drift is strong enough, it overwhelms everything inside the groups.

02 · The classic case

Berkeley, 1973: the bias that pointed both ways

The most famous Simpson reversal is real. In autumn 1973, UC Berkeley's graduate school admitted about 44% of male applicants and 35% of female applicants — a gap large enough to prompt fears of a lawsuit. But when statisticians went department by department, the picture flipped: in most departments women were admitted at an equal or higher rate than men.

Admission rate by department, autumn 1973 Six largest departments, anonymised A–F

Men Women

Fig. 2 — The flip in the wild. Overall, men were admitted more often. Department by department, women matched or beat them in four of six. Both facts are true at once.

The resolution is in where people applied. Men flooded departments A and B, which admitted over 60% of everyone. Women overwhelmingly applied to departments like C through F — more crowded, better-subscribed, with admission rates as low as 6%. The aggregate gap wasn't measuring how departments treated women; it was measuring which doors women were queueing at.

Where each group's applications went Width = share of that group's applications · Darker = harder to get into

Fig. 3 — The hidden weighting. Men's applications piled into the easy departments (A, B); women's into the hard ones (C–F). The pooled admission rate is a weighted average — and the weights were doing all the talking.

03 · Which number should you trust?

The data can't tell you. The causal story can.

Here is the uncomfortable part: there is no general rule that the split-up view is "the real one". It depends on what the third variable is doing.

In the cholesterol example, age is a confounder — it causes both the exercise habits and the cholesterol. To estimate what exercise does, you must compare like with like: condition on age, trust the within-group lines.

But imagine the third variable sits on the causal path — say, a drug that works by lowering blood pressure. Splitting patients by their post-treatment blood pressure would slice away the very effect you're trying to measure. There, the pooled answer is the honest one.

That's the deepest lesson of Simpson's paradox: which average to believe is not a statistical question. It's a question about how the world works, and the spreadsheet alone cannot answer it.

The confounding triangle

Fig. 4 — Why age must be held still. Age pushes on both ends of the relationship we care about. Ignore it, and its influence masquerades as the exercise→cholesterol arrow.

04 · Field notes

Where this paradox bites in practice

Medicine. In a famous kidney-stone study, the less invasive treatment looked better overall — but the surgery beat it for small stones and for large stones. Surgeons had simply been assigned the hardest cases, dragging their pooled numbers down.

Public health. In mid-2021 Israeli data, vaccinated people made up a majority of some hospitalised COVID cases — alarming, until stratified by age. Within every age band the vaccine sharply reduced severe disease; the pooled figure reflected the fact that nearly all elderly people were vaccinated.

Sport. The cleanest specimen on record was popularised by mathematician Ken Ross in A Mathematician at the Ballpark. David Justice out-hit Derek Jeter in 1995, and out-hit him again in 1996. Combine the two seasons, and Jeter wins — comfortably.

Ken Ross's ballpark reversal: Jeter vs. Justice, 1995–96 Bar height = batting average · Bar width = at-bats

Derek Jeter David Justice

Fig. 5 — Width decides the winner. Justice's average is higher in 1995 (.253 vs .250) and again in 1996 (.321 vs .314). But a combined average weights each season by its at-bats — the width of each bar. Jeter's brilliant season is 582 at-bats wide and his poor one just 48; Justice's proportions run the other way. Pool them, and Jeter's .310 beats Justice's .270. Example as told by Ken Ross, A Mathematician at the Ballpark (2004).

The pattern to watch for is always the same: a comparison between groups whose composition differs in a way that matters. Whenever someone shows you a single pooled number — a national average, an overall success rate, a combined trend — the first question is not "is it correct?" but "what is it averaging over?"

Every group says one thing. The total says the opposite.