unspurious.

The aggregation illusions · Lord's paradox

Two analysts. One dataset. Opposite verdicts.

Lord's paradox: whether a treatment "made a difference" can hinge entirely on whether you study the change — or the destination, given the start.

Student weights, September vs. June — 84 students, one dining hall Frederic Lord's 1967 thought experiment. Same points in both views.
Fig. 1 — Two honest readings. Both group averages sit exactly on the "June = September" diagonal: on average, nobody gained or lost an ounce. Yet fit a line within each group and compare students who started at the same weight, and the boys finish heavier every time. (Illustrative data after Lord, 1967.)
The short answer

What is Lord's paradox?

Lord's paradox is when two reasonable statisticians analyse the same before-and-after data — one comparing raw change, the other adjusting for the starting value — and reach opposite conclusions about whether a group changed. Both analyses are correct; they answer different causal questions. Which one is right depends on assumptions about the world that the data alone cannot settle.

The fast check“Is the controlled variable a nuisance — or part of the story?”

01 · The setup

Lord's dining hall

In 1967 the psychometrician Frederic Lord posed a deceptively small puzzle. A university weighs every student in September and again in June, and wants to know what the dining-hall diet did — and in particular whether it treated the sexes differently.

The first statistician does the obvious thing: compute each student's gain, then average by group. Both averages come out to zero. Verdict: the diet changed nothing, for anyone, and there is no difference between the sexes to explain.

The second statistician runs an analysis of covariance — comparing June weights among students who started at the same weight. At any given starting weight, the boys finish around a dozen pounds heavier. Verdict: a substantial sex difference in how the diet landed.

Both computations are flawless. They simply refuse to agree.

Lord's own conclusion was the unsettling part: when groups differ before the treatment ever begins, he argued, no statistical adjustment can be relied upon to make "proper allowance" for those pre-existing differences. The disagreement is not an error to be fixed.

02 · The engine

Regression to the mean, wearing a disguise

Why does adjusting for the start manufacture a difference where the gain scores see none? Because September and June weights are correlated, but not perfectly. Anyone far from their own group's average in September tends to drift back toward it by June — heavy-for-their-group students drift down, light-for-their-group students drift up.

Now look at the only place the two groups can be compared "at the same starting weight": the overlap zone. A 142-pound student is a heavy girl but a light boy. She is expected to drift down toward the girls' average; he is expected to drift up toward the boys'. Hold the starting weight fixed, and the groups pull apart — with no diet required.

The overlap zone: students starting near 142 lb Each arrow points from a student's starting weight to their expected June weight
Boys Girls
Fig. 2 — Same start, different gravity. Within the shaded band, every boy is light for a boy and every girl is heavy for a girl. Each is pulled toward their own group's average — in opposite directions. "Comparing like with like" here compares an unusual member of one group with an unusual member of the other.

03 · Which statistician is right?

They're answering different questions

The modern resolution, sharpened by causal-inference work from Paul Holland, Donald Rubin and Judea Pearl, is that the two analyses estimate different quantities — and the data alone cannot referee between them.

Statistician 1's gain score tracks the total association between group and weight change. Statistician 2's adjustment blocks the path running through September weight, leaving only the direct link between group and June weight.

Whether blocking that path is legitimate depends on what September weight is, causally. If it's a confounding nuisance, adjust away. But here the groups' starting weights are themselves a consequence of the thing being compared — boys weigh more because they're boys. Adjusting for a downstream consequence quietly removes part of the very difference under study.

As with Simpson's paradox, the moral is austere: choosing the analysis is choosing the question. Make that choice out loud, before the data can make it for you.

Total effect vs. direct effect
Fig. 3 — One triangle, two readings. Statistician 1 reads everything flowing out of GROUP. Statistician 2 pins SEPT WEIGHT in place and reads only the direct arrow — discarding the route that runs through the starting weights.

04 · Field notes

Where the two statisticians still argue

School league tables. Judge schools by raw results and you reward privileged intakes; judge by progress "adjusted for intake" and Lord's paradox walks in the door — the adjustment itself can manufacture or erase differences between schools serving different populations.

Medicine without randomisation. Pre/post studies comparing treatment groups that differed at baseline are the paradox's natural habitat. Change scores and baseline-adjusted models can give opposite answers about the same drug, and journals have hosted decades of argument over which to report.

Pay gaps. A raw gap and a gap "adjusted for role, hours and seniority" are both real numbers answering different questions — and if the adjusting variables are themselves shaped by the thing being studied, the adjusted figure quietly changes the subject.

The tell is always a comparison of groups that differed before anything happened, plus the phrase "after controlling for…". At that moment, ask the only question that matters: is the controlled variable a nuisance to be removed, or part of the story being measured?

Continue the field guide

More ways to be honestly wrong