unspurious.

The inference illusions · An interactive tool

Find a discovery in pure noise.

Below is a trial of a drug that does nothing. Your job: torture the data — legally — until it confesses. Then we'll count the shots you fired.

"NeuroCalm" vs placebo — 240 volunteers, five outcomes measured The treatment has zero real effect. Every choice below is one a real researcher might defend.
Primary outcome
Look at which people?
Unusual responders
Adjust for age?
Test direction
Headline this analysis would print
0.50
p-value
★ SIGNIFICANT
specifications tried 1 / 200 best p found 0.50 significant ones you've hit 0

The wall, with the bullets shown

Every analysis you could have run, at once

The data
0
true effect of the drug. It was random numbers the whole time.
Paths that "worked"
13
of 200 possible analyses crossed p < .05 — purely by chance.
Expected by luck
~10
at a 1-in-20 threshold, 200 shots are bound to land a few.
The specification curve. Each dot is one of the 200 defensible analyses of this exact dataset, sorted from most to least significant. The claret dots dipped below the famous .05 line — and there are about as many as random chance predicts. The lower line is where the threshold should sit once you admit how many analyses were on the table (a Bonferroni correction, .05 ÷ 200). Nothing comes close to it. The drug never did anything; the garden of forking paths did all the work.
Read the full entry ▸
How to play. Start from the honest default — everyone, the primary outcome, no tricks — and it reads a thoroughly boring p = 0.50. Now go hunting. Switch outcomes, slice to a subgroup, drop the "outliers", adjust for age, flip to a one-tailed test. Each move is individually defensible, and somewhere in the combinations a p < .05 is almost certainly waiting. When you find one, hit reveal.

01 · You didn't cheat. That's the point.

No single dishonest step

Notice what you didn't do. You didn't alter a number, delete an inconvenient person, or run the same test twice. Every choice on the panel — which outcome to feature, whether to focus on the subgroup where the drug "obviously" helps, whether to trim implausible responders, whether to adjust for an imbalance in age — is a choice a careful, honest researcher makes every day, and can defend in a methods section.

The trouble is only visible from above: those defensible choices multiply. Five outcomes times five subgroups times two outlier rules times two adjustments times two test directions is two hundred analyses, and you reported the one that worked. The statisticians Andrew Gelman and Eric Loken called this the garden of forking paths — you needn't consciously try all two hundred for the logic to bite. If you'd have stopped hunting the moment any result turned up, the effective number of shots is still two hundred.

A p-value answers "how surprising is this one result?" It cannot see the other 199 analyses you'd have accepted. Only you know how many shots were really fired — and the published paper never shows the wall.

02 · The honest fixes

Paint the target first

Every cure amounts to fixing the target before firing. Pre-registration commits you to one outcome, one subgroup and one test before you see the data, so there is only ever one shot. Correcting for multiplicity — dividing the threshold Bonferroni-style, or controlling the false-discovery rate — lowers the bar to match the number of analyses, which is why the reveal's second line sits so far down. A held-out sample lets the hunt generate a hypothesis on one half and test it honestly on the other. And the cheapest discipline of all: treat any subgroup-specific, outlier-trimmed, one-tailed surprise as a question for the next study, never a conclusion from this one.

The specification curve in the reveal is itself a fix. Plotting all the analyses, rather than one, is the modern multiverse approach: if a finding is real, it tends to survive most reasonable choices; if it vanishes the moment you nudge a defensible dial, it was probably a painted target.

03 · Read more

The salmon, the star signs, and the rest

This sandbox is the playable companion to the compendium's entry on the Texas sharpshooter fallacy — home to the dead salmon that "passed" an fMRI test, the heart-attack trial that found aspirin useless for patients born under Gemini, and the finance backtests that manufacture profitable strategies from noise. If torturing this dataset was satisfying, the entry is where the same trick gets caught in the wild.

Keep going

More from the compendium