unspurious.

The inference illusions · The Texas sharpshooter

Fire enough shots, and a bullseye paints itself.

The Texas sharpshooter fallacy: test enough hypotheses and pure chance guarantees a "discovery" — then the target gets drawn around wherever the bullets happened to land.

A drug with zero effect, tested in twenty subgroups Both arms drawn from identical 30% response rates · 150 patients per arm per subgroup
Fig. 1 — The discovery machine. Every dot is a subgroup comparison in which the drug truly does nothing — both arms are sampled from the same coin. Yet dots regularly drift past the p = .05 line, because at a 5% false-positive rate per test, twenty tests yield at least one "hit" about 64% of the time (1 − 0.95²⁰). Keep firing and watch the drug "work" for a different subgroup each run.
The short answer

What is the Texas sharpshooter fallacy?

The Texas sharpshooter fallacy is finding a pattern first and then treating it as if it were predicted — firing at a barn, then painting the target around the tightest cluster of holes. In statistics it appears when many analyses, subgroups or outcomes are tested and the one that happens to look significant is reported as a discovery. Run enough comparisons on pure noise and chance alone will hand you a “significant” result.

The fast check“How many shots were fired before the target was painted?”

01 · The sharpshooter's method

First shoot, then draw the target

The fallacy is named for a Texan marksman who sprays a barn wall with bullets, finds the tightest cluster of holes, and paints a target around it. Inspected after the fact, any spray of randomness contains clusters — the deceit is in pretending the bullseye came first.

A p-value below .05 is a promise about a single, pre-aimed shot: if there were nothing here, results this extreme would arise by luck only one time in twenty. Fire twenty shots — twenty subgroups, twenty outcome measures, twenty foods, twenty trading rules — and "one time in twenty" stops being reassuring and starts being a schedule. The figure above is nothing but that schedule running on schedule.

The trap rarely announces itself, because nobody publishes the nineteen misses. A paper, press release or pitch deck shows you the painted target — left-handed smokers respond to the drug, p = .03 — and the wall full of stray holes stays in a drawer.

A p-value answers "how surprising is this shot?" It cannot answer "how many shots were fired?" — and the second question is the one that matters.

02 · The most instructive corpse in neuroscience

The salmon that passed the test

In 2009 the neuroscientist Craig Bennett and colleagues placed a whole Atlantic salmon — purchased at a market, definitively dead — into an fMRI scanner, showed it photographs of people in social situations, and "asked" it to judge their emotions. Then they analysed the scan the way much of the field then did: roughly 130,000 tiny brain regions, each tested separately for task-related activity, with no correction for the number of tests.

A small cluster of voxels in the dead fish's brain cavity lit up as statistically significant. The salmon, by the conventional threshold, was processing human emotion.

The study — which earned an Ig Nobel Prize in 2012 — was a deliberate piece of statistical theatre. At a per-test false-positive rate, 130,000 tests must produce a scatter of spurious hits, and adjacent spurious voxels will sometimes form convincing-looking clusters. Its serious legacy is that multiple-comparison corrections, once skipped in a sizeable fraction of imaging papers, became impossible to omit politely.

One subject, ~130,000 hypothesis tests After Bennett et al., 2009 · schematic
Fig. 2 — Post-mortem cognition, p < .001 (uncorrected). Test every voxel separately and chance alone decorates the scan with "activity" — occasionally in tidy clusters. The fish is not thinking about the photographs. The threshold is.

03 · The honest researcher's version

The garden of forking paths

The sharpshooter needn't be cynical. As Andrew Gelman and Eric Loken argued, a researcher who runs one analysis can still be implicitly choosing it from a garden of forking paths: exclude that outlier or keep it, adjust for age or don't, take the mean or the median, report endpoint A or endpoint B. Each choice is defensible; together they multiply into dozens of analyses that could have been run — and the one that reaches print is, naturally, one that "worked". No single dishonest step is taken, and the wall still ends up painted.

The known remedies all amount to fixing the target before firing. Pre-register the analysis. Correct the threshold for the number of comparisons — divide it Bonferroni-style, or control the false-discovery rate. Hold out data the choices never touched. And the cheapest remedy of all: treat any surprising, slice-specific finding as a hypothesis for the next study, not a conclusion from this one.

Sixteen ways to analyse one dataset Four defensible decisions · 2⁴ = 16 possible papers · the null is true throughout
Fig. 3 — One path gets published. Every leaf is the p-value of a legitimate-looking analysis of the same null data. A researcher wandering the garden — trying branches until something looks promising — will find the claret leaf without ever feeling dishonest. The other fifteen paths are never mentioned, and the reader cannot know they existed.

04 · Field notes

Painted targets in the wild

Clinical trials taught the lesson on purpose. The landmark ISIS-2 heart-attack trial (1988) demonstrated convincingly that aspirin saves lives — and its authors, to dramatise the perils of subgroup slicing, also reported that the benefit vanished for patients born under Gemini or Libra. Peter Austin's group later ran the joke in reverse, trawling Ontario hospital records by star sign and duly "finding" zodiac-linked diagnoses. Both were warnings, dressed as findings, about how easily slicing manufactures significance.

Finance backtests. Try a thousand trading rules on the same price history and dozens will be wildly "profitable" in-sample, for exactly the reason the dead salmon thought about photographs. The strategies that get sold are the painted targets; the out-of-sample future is the part of the wall nobody shot yet.

Cancer clusters and headline nutrition. The term "Texas sharpshooter" comes from epidemiology, where apparent disease clusters around some landmark must be weighed against the thousands of neighbourhoods in which nobody went looking. Food-and-health headlines run the same engine: survey hundreds of foods against dozens of outcomes and chance alone keeps the front pages stocked.

The protective question never changes: how many shots were fired before this target was painted? Ask how many subgroups, endpoints, foods or strategies were tested; whether the hypothesis predates the data; whether the threshold was corrected; whether it replicates on a wall the shooter hasn't seen. A genuine bullseye survives all four questions. A painted one rarely survives the first.

Continue the field guide

More ways to be honestly wrong