‹ The blog12 June 202610 min read

Simpson's paradox in the wild: three examples that fooled the experts

Berkeley admissions, COVID vaccine tables and a famous baseball season — three true reversals, told with the figures.

The fastest way to understand Simpson's paradox is not through its definition but through true stories in which a trend cleanly reversed the moment data were combined. This post walks through three of them in detail — a university, a pandemic and a baseball season — because each one fooled intelligent people who were looking directly at correct numbers, and each one reveals a different face of the same machine.

A one-line refresher before we start: Simpson's paradox is the phenomenon in which a relationship that holds inside every subgroup of a dataset weakens, vanishes, or flips when the subgroups are pooled. Nothing is miscounted. The pooled number and the group numbers are all true; they simply answer different questions.

Berkeley, 1973: the bias that pointed both ways

In the autumn of 1973, the University of California, Berkeley admitted roughly 44% of the men who applied to its graduate programmes and roughly 35% of the women — a nine-point gap, large enough that the university feared litigation and asked its own statisticians to investigate. The resulting analysis, published by Peter Bickel, Eugene Hammel and J. W. O'Connell in Science in 1975, became one of the most famous papers in applied statistics, because what it found was not a smoking gun but a reversal.

Department by department, the apparent bias dissolved. In four of the six largest departments women were admitted at an equal or higher rate than men — in the most selective-sounding case, dramatically higher. Pooled, the numbers accused the university; disaggregated, they pointed mildly the other way.

Admission rate by department, autumn 1973UC Berkeley graduate admissions · six largest departments, anonymised A–F · Bickel, Hammel & O'Connell (1975)

Men Women

The flip in the wild. Left of the dashed line, the pooled rates that triggered the inquiry: men 44%, women 35%. Right of it, the same applicants split by department — women match or beat men in four of six.

The resolution is in where people applied. Men's applications flooded departments A and B, which admitted well over half of everyone who knocked. Women's applications concentrated in departments C through F — oversubscribed programmes admitting a quarter, a tenth, in one case six per cent of applicants. The pooled gap wasn't measuring how departments treated women. It was measuring which queues women were standing in.

Where each group's applications wentWidth = share of that group's applications · darker = easier to get into

The hidden weighting. Men's applications piled into the departments with open doors; women's into the departments with narrow ones. A pooled admission rate is a weighted average, and here the weights were doing all the talking.

Two cautions before moving on. First, the reversal does not prove the absence of discrimination — it relocates the question. Why were the programmes women favoured so under-resourced and oversubscribed? That is a real and serious question; it is simply a different one from “do admissions committees reject women?”. Second, the case shows why no mechanical rule (“always trust the disaggregated numbers”) can work: deciding which view answers your question requires knowing what causally drives the third variable. We will come back to that.

August 2021: the vaccine data that “proved” the wrong thing

In the summer of 2021, screenshots of Israeli Ministry of Health dashboards circulated with an alarming claim attached: a majority of the patients hospitalised with severe COVID were fully vaccinated. The underlying number was real. In one widely analysed mid-August snapshot, about 301 of 515 severe cases — roughly 58% — were vaccinated people. To many readers, the conclusion wrote itself: the vaccines had failed.

The biostatistician Jeffrey Morris produced the now-standard dissection of that snapshot, and it is a perfect Simpson specimen. Israel had vaccinated its elderly almost completely — the very group whose risk of severe disease is tens of times higher. When nearly everyone in the high-risk group is vaccinated, simple arithmetic guarantees that most severe cases will be vaccinated even if the vaccine works extremely well. The raw count compares a huge, old, vaccinated population against a small, young, unvaccinated one.

One snapshot, two readingsIsraeli Ministry of Health data, 15 August 2021 · analysis after J. Morris (covid-datascience.com)

Unvaccinated Vaccinated

Counts accuse; rates acquit. Left: the viral headline — most severe cases were vaccinated. Right: the same people expressed as rates per 100,000 within age bands. Among under-50s, severe disease ran at 0.3 per 100k vaccinated against 3.9 unvaccinated; among over-50s, 13.6 against 91.9 — protection of roughly 85–92% within every band, hidden entirely by the pooled count.

The third variable here is age, and it plays exactly Berkeley's role: it determines both who is vaccinated and who lands in hospital, so the pooled figure blends two populations that should never be directly compared. Note also the direction of the rhetoric — at Berkeley the pooled number manufactured an accusation; here it manufactured an exoneration of the wrong suspect. Simpson's paradox has no political loyalty. It amplifies whichever wrong conclusion the mix of groups happens to favour.

1995–96: two summers of baseball

The cleanest specimen on record needs no public-health stakes at all. In 1995, David Justice batted .253 to Derek Jeter's .250 — Justice wins. In 1996, Justice batted .321 to Jeter's .314 — Justice wins again. Combine the two seasons and Jeter bats .310 to Justice's .270, a thumping victory for the man who lost both years. The example was popularised by the mathematician Ken Ross in A Mathematician at the Ballpark, and it endures because there is nowhere for intuition to hide: the reversal is pure arithmetic, visible in a single picture.

Ken Ross's ballpark reversal: Jeter vs. Justice, 1995–96Bar height = batting average · bar width = at-bats

Derek Jeter David Justice

Width decides the winner. A combined average weights each season by its at-bats — the width of each bar. Jeter's brilliant season is 582 at-bats wide and his poor one a sliver of 48; Justice's proportions run the other way. Pool them and the order flips.

The third variable is playing time. Jeter's terrible 1995 barely counts in his career average (48 at-bats); his excellent 1996 counts enormously (582). Justice's seasons were weighted the other way round. The combined average is honest about the question it answers — “across all these at-bats, who hit more often?” — and silent about the question most people hear, “who was the better hitter each year?”.

The shape all three share

Strip away the details and the same triangle sits under every case. There is the comparison you care about — sex and admission, vaccination and severe disease, player and batting success. And there is a third variable — department, age, at-bats — with arrows into both sides of it: it influences which group you're in and it influences the outcome. Pool the data and the third variable's influence masquerades as the relationship you wanted to measure.

The confounding triangleThe structure beneath all three cases

One diagram, three stories. Substitute department, age or at-bats for the third variable and the same machine produces all three reversals. The pooled data measures the claret arrows; the headline claims the dashed one.

This is also why “which number is true?” has no statistical answer. Both are true. The honest question is which one answers what you're asking — and that depends on whether the third variable is a nuisance to be held fixed (age, in the vaccine case) or part of the very thing under study. The data cannot tell you which; only an argument about how the world works can. The full entry develops that point, and its siblings — Lord's paradox, where two adjustments of the same data give opposite verdicts, and Berkson's paradox, where the triangle reappears with its arrows reversed — complete the family.

The paradox never lies about the numbers. It lies about which question the numbers were answering.

So when the next pooled statistic crosses your feed — an overall rate, a combined trend, a national average moving against your lived experience — reach for the one question that defuses all three cases above: what is this number averaging over, and could the mix be doing the work?

Simpson's paradox in the wild: three examples that fooled the experts

Berkeley, 1973: the bias that pointed both ways

August 2021: the vaccine data that “proved” the wrong thing

1995–96: two summers of baseball

The shape all three share

Keep reading

Simpson's Paradox

Berkson's Paradox

Five paradoxes in the news