Simpson's paradox in the wild: three examples that fooled the experts
Berkeley admissions, COVID vaccine tables and a famous baseball season — three true reversals, told with the figures.
The fastest way to understand Simpson's paradox is not through its definition but through true stories in which a trend cleanly reversed the moment data were combined. This post walks through three of them in detail — a university, a pandemic and a baseball season — because each one fooled intelligent people who were looking directly at correct numbers, and each one reveals a different face of the same machine.
A one-line refresher before we start: Simpson's paradox is the phenomenon in which a relationship that holds inside every subgroup of a dataset weakens, vanishes, or flips when the subgroups are pooled. Nothing is miscounted. The pooled number and the group numbers are all true; they simply answer different questions.
Berkeley, 1973: the bias that pointed both ways
In the autumn of 1973, the University of California, Berkeley admitted roughly 44% of the men who applied to its graduate programmes and roughly 35% of the women — a nine-point gap, large enough that the university feared litigation and asked its own statisticians to investigate. The resulting analysis, published by Peter Bickel, Eugene Hammel and J. W. O'Connell in Science in 1975, became one of the most famous papers in applied statistics, because what it found was not a smoking gun but a reversal.
Department by department, the apparent bias dissolved. In four of the six largest departments women were admitted at an equal or higher rate than men — in the most selective-sounding case, dramatically higher. Pooled, the numbers accused the university; disaggregated, they pointed mildly the other way.
The resolution is in where people applied. Men's applications flooded departments A and B, which admitted well over half of everyone who knocked. Women's applications concentrated in departments C through F — oversubscribed programmes admitting a quarter, a tenth, in one case six per cent of applicants. The pooled gap wasn't measuring how departments treated women. It was measuring which queues women were standing in.
Two cautions before moving on. First, the reversal does not prove the absence of discrimination — it relocates the question. Why were the programmes women favoured so under-resourced and oversubscribed? That is a real and serious question; it is simply a different one from “do admissions committees reject women?”. Second, the case shows why no mechanical rule (“always trust the disaggregated numbers”) can work: deciding which view answers your question requires knowing what causally drives the third variable. We will come back to that.
August 2021: the vaccine data that “proved” the wrong thing
In the summer of 2021, screenshots of Israeli Ministry of Health dashboards circulated with an alarming claim attached: a majority of the patients hospitalised with severe COVID were fully vaccinated. The underlying number was real. In one widely analysed mid-August snapshot, about 301 of 515 severe cases — roughly 58% — were vaccinated people. To many readers, the conclusion wrote itself: the vaccines had failed.
The biostatistician Jeffrey Morris produced the now-standard dissection of that snapshot, and it is a perfect Simpson specimen. Israel had vaccinated its elderly almost completely — the very group whose risk of severe disease is tens of times higher. When nearly everyone in the high-risk group is vaccinated, simple arithmetic guarantees that most severe cases will be vaccinated even if the vaccine works extremely well. The raw count compares a huge, old, vaccinated population against a small, young, unvaccinated one.
The third variable here is age, and it plays exactly Berkeley's role: it determines both who is vaccinated and who lands in hospital, so the pooled figure blends two populations that should never be directly compared. Note also the direction of the rhetoric — at Berkeley the pooled number manufactured an accusation; here it manufactured an exoneration of the wrong suspect. Simpson's paradox has no political loyalty. It amplifies whichever wrong conclusion the mix of groups happens to favour.
1995–96: two summers of baseball
The cleanest specimen on record needs no public-health stakes at all. In 1995, David Justice batted .253 to Derek Jeter's .250 — Justice wins. In 1996, Justice batted .321 to Jeter's .314 — Justice wins again. Combine the two seasons and Jeter bats .310 to Justice's .270, a thumping victory for the man who lost both years. The example was popularised by the mathematician Ken Ross in A Mathematician at the Ballpark, and it endures because there is nowhere for intuition to hide: the reversal is pure arithmetic, visible in a single picture.
The third variable is playing time. Jeter's terrible 1995 barely counts in his career average (48 at-bats); his excellent 1996 counts enormously (582). Justice's seasons were weighted the other way round. The combined average is honest about the question it answers — “across all these at-bats, who hit more often?” — and silent about the question most people hear, “who was the better hitter each year?”.
The shape all three share
Strip away the details and the same triangle sits under every case. There is the comparison you care about — sex and admission, vaccination and severe disease, player and batting success. And there is a third variable — department, age, at-bats — with arrows into both sides of it: it influences which group you're in and it influences the outcome. Pool the data and the third variable's influence masquerades as the relationship you wanted to measure.
This is also why “which number is true?” has no statistical answer. Both are true. The honest question is which one answers what you're asking — and that depends on whether the third variable is a nuisance to be held fixed (age, in the vaccine case) or part of the very thing under study. The data cannot tell you which; only an argument about how the world works can. The full entry develops that point, and its siblings — Lord's paradox, where two adjustments of the same data give opposite verdicts, and Berkson's paradox, where the triangle reappears with its arrows reversed — complete the family.
So when the next pooled statistic crosses your feed — an overall rate, a combined trend, a national average moving against your lived experience — reach for the one question that defuses all three cases above: what is this number averaging over, and could the mix be doing the work?