01 · What just happened
One scatter plot, two honest answers
Neither view of Figure 1 is a trick. The upward line is the genuine least-squares fit to all 54 patients. The three downward lines are the genuine fits within each age band. Both are arithmetically correct — and they recommend opposite things.
The reversal happens because a third variable, age, is tangled up with both axes. Older patients in this clinic exercise more (their doctors insist on it) and also have higher cholesterol (age does that). When you pool everyone, you aren't measuring the effect of exercise any more — you're mostly measuring the effect of being old enough to be told to exercise.
A pooled trend is a weighted blend of the within-group trends plus the trend between the groups themselves. When the between-group drift is strong enough, it overwhelms everything inside the groups.
02 · The classic case
Berkeley, 1973: the bias that pointed both ways
The most famous Simpson reversal is real. In autumn 1973, UC Berkeley's graduate school admitted about 44% of male applicants and 35% of female applicants — a gap large enough to prompt fears of a lawsuit. But when statisticians went department by department, the picture flipped: in most departments women were admitted at an equal or higher rate than men.
The resolution is in where people applied. Men flooded departments A and B, which admitted over 60% of everyone. Women overwhelmingly applied to departments like C through F — more crowded, better-subscribed, with admission rates as low as 6%. The aggregate gap wasn't measuring how departments treated women; it was measuring which doors women were queueing at.
03 · Which number should you trust?
The data can't tell you. The causal story can.
Here is the uncomfortable part: there is no general rule that the split-up view is "the real one". It depends on what the third variable is doing.
In the cholesterol example, age is a confounder — it causes both the exercise habits and the cholesterol. To estimate what exercise does, you must compare like with like: condition on age, trust the within-group lines.
But imagine the third variable sits on the causal path — say, a drug that works by lowering blood pressure. Splitting patients by their post-treatment blood pressure would slice away the very effect you're trying to measure. There, the pooled answer is the honest one.
That's the deepest lesson of Simpson's paradox: which average to believe is not a statistical question. It's a question about how the world works, and the spreadsheet alone cannot answer it.
04 · Field notes
Where this paradox bites in practice
Medicine. In a famous kidney-stone study, the less invasive treatment looked better overall — but the surgery beat it for small stones and for large stones. Surgeons had simply been assigned the hardest cases, dragging their pooled numbers down.
Public health. In mid-2021 Israeli data, vaccinated people made up a majority of some hospitalised COVID cases — alarming, until stratified by age. Within every age band the vaccine sharply reduced severe disease; the pooled figure reflected the fact that nearly all elderly people were vaccinated.
Sport. The cleanest specimen on record was popularised by mathematician Ken Ross in A Mathematician at the Ballpark. David Justice out-hit Derek Jeter in 1995, and out-hit him again in 1996. Combine the two seasons, and Jeter wins — comfortably.
The pattern to watch for is always the same: a comparison between groups whose composition differs in a way that matters. Whenever someone shows you a single pooled number — a national average, an overall success rate, a combined trend — the first question is not "is it correct?" but "what is it averaging over?"