01 · The setup
Lord's dining hall
In 1967 the psychometrician Frederic Lord posed a deceptively small puzzle. A university weighs every student in September and again in June, and wants to know what the dining-hall diet did — and in particular whether it treated the sexes differently.
The first statistician does the obvious thing: compute each student's gain, then average by group. Both averages come out to zero. Verdict: the diet changed nothing, for anyone, and there is no difference between the sexes to explain.
The second statistician runs an analysis of covariance — comparing June weights among students who started at the same weight. At any given starting weight, the boys finish around a dozen pounds heavier. Verdict: a substantial sex difference in how the diet landed.
Both computations are flawless. They simply refuse to agree.
Lord's own conclusion was the unsettling part: when groups differ before the treatment ever begins, he argued, no statistical adjustment can be relied upon to make "proper allowance" for those pre-existing differences. The disagreement is not an error to be fixed.
02 · The engine
Regression to the mean, wearing a disguise
Why does adjusting for the start manufacture a difference where the gain scores see none? Because September and June weights are correlated, but not perfectly. Anyone far from their own group's average in September tends to drift back toward it by June — heavy-for-their-group students drift down, light-for-their-group students drift up.
Now look at the only place the two groups can be compared "at the same starting weight": the overlap zone. A 142-pound student is a heavy girl but a light boy. She is expected to drift down toward the girls' average; he is expected to drift up toward the boys'. Hold the starting weight fixed, and the groups pull apart — with no diet required.
03 · Which statistician is right?
They're answering different questions
The modern resolution, sharpened by causal-inference work from Paul Holland, Donald Rubin and Judea Pearl, is that the two analyses estimate different quantities — and the data alone cannot referee between them.
Statistician 1's gain score tracks the total association between group and weight change. Statistician 2's adjustment blocks the path running through September weight, leaving only the direct link between group and June weight.
Whether blocking that path is legitimate depends on what September weight is, causally. If it's a confounding nuisance, adjust away. But here the groups' starting weights are themselves a consequence of the thing being compared — boys weigh more because they're boys. Adjusting for a downstream consequence quietly removes part of the very difference under study.
As with Simpson's paradox, the moral is austere: choosing the analysis is choosing the question. Make that choice out loud, before the data can make it for you.
04 · Field notes
Where the two statisticians still argue
School league tables. Judge schools by raw results and you reward privileged intakes; judge by progress "adjusted for intake" and Lord's paradox walks in the door — the adjustment itself can manufacture or erase differences between schools serving different populations.
Medicine without randomisation. Pre/post studies comparing treatment groups that differed at baseline are the paradox's natural habitat. Change scores and baseline-adjusted models can give opposite answers about the same drug, and journals have hosted decades of argument over which to report.
Pay gaps. A raw gap and a gap "adjusted for role, hours and seniority" are both real numbers answering different questions — and if the adjusting variables are themselves shaped by the thing being studied, the adjusted figure quietly changes the subject.
The tell is always a comparison of groups that differed before anything happened, plus the phrase "after controlling for…". At that moment, ask the only question that matters: is the controlled variable a nuisance to be removed, or part of the story being measured?