Why is correlation not causation?

Because two things can move together for reasons that have nothing to do with one causing the other — most often because both are driven by a third factor, or because both simply trend over time and happen to drift in step. Independent random data routinely produces strong correlations, especially when the series are trending, so a high correlation coefficient on its own is not evidence of a real relationship. Establishing causation needs a plausible mechanism, controls for confounders, and ideally an experiment — not just a correlation.

The Spurious Correlation Machine — watch random data correlate

01 · What you're seeing

A high correlation is the easiest thing in the world to find

The two series the machine draws share no cause, no link, no contact of any kind — they are separate calls to a random number generator. And still, press after press, their correlation climbs into territory that in a research paper would read as a strong, publishable finding. The lesson is uncomfortable: a large correlation coefficient, on its own, is almost worthless as evidence that two things are related. It is a starting point for investigation, not a conclusion.

The trick the machine exploits is that we instinctively read a correlation as if it were rare and meaningful — as if two lines tracking each other must be connected. They needn't be. Given enough wiggle room, unrelated numbers cohere all the time, and our pattern-hungry eyes supply the story for free.

Switch the machine to flat “coin-flip noise” and the spurious correlations grow much rarer. That single toggle is the whole insight: it is trends, not randomness in general, that make fake correlations so easy.

02 · Why trends are the trap

Anything that drifts over time will seem to move together

The default setting uses random walks — series where each step nudges up or down from the last, so the line wanders and drifts the way real-world time series do: populations, prices, temperatures, your follower count. Two independent random walks are notorious for looking correlated, because each one tends to spend long stretches drifting in a single direction. When both happen to drift upward over the same window — as trending data so often does — the correlation coefficient lights up, despite there being no link whatsoever.

Trends manufacture correlation; flat noise doesn'tSimulated, sequences of 24 points

Fig. 2 — The danger is the drift. Run the experiment thousands of times. Two unrelated random walks clear a correlation of 0.8 a remarkable share of the time, while two flat, trendless noise series almost never do. This is the heart of what economists call spurious regression: correlate any two quantities that both trend over time — ice-cream sales and drowning deaths, a stock and the weather — and you are likely to get a strong relationship built from nothing but shared drift.

This is why correlating two time series is one of the most treacherous things you can do with data, and why careful analysts detrend first or model the change rather than the level. The famous galleries of absurd correlations — cheese consumption against bedsheet fatalities, a film star's releases against drowning rates — are funny precisely because every series in them is trending, which all but guarantees a few will line up.

03 · The hunt makes it worse

Search enough pairs and a stunner is guaranteed

Press the machine's Hunt button and it stops drawing one pair at a time and instead rifles through hundreds, keeping the most extreme correlation it stumbles on. Within a second or two it will hand you something near-perfect. That isn't luck — it's arithmetic. Each pair has some modest chance of looking strongly correlated; check enough of them and finding at least one becomes a near-certainty.

Fishing turns a slim chance into a sure thingProbability of finding at least one |r| > 0.9

Fig. 3 — The multiple-comparisons multiplier. A single pair of trending series rarely clears 0.9. But the chance of finding at least one such pair climbs fast as you check more, reaching near-certainty within a few dozen. This is the same engine behind the Texas sharpshooter and p-hacking: when you get to pick the winner after the fact, randomness will always supply one.

Anyone with a large enough pile of variables can therefore “discover” a jaw-dropping correlation and present just that one, with the hundreds of failures politely omitted. The correlation is real in the data; the impression it gives — that something meaningful was found — is the illusion.

04 · How to use it

Three questions for any correlation you meet

The machine is a vaccine. Once you have watched random numbers produce a dozen gorgeous correlations, the next “study finds X linked to Y” headline lands differently. Three questions defuse most of them:

Is there a plausible mechanism? Correlation is only the beginning of a causal claim; without a story for how one thing could move the other — one that was proposed before the data was dredged — a strong r is just a coincidence with good lighting.

Are both things trending over time? If so, treat the correlation as guilty until proven innocent. Shared drift is the single commonest source of spurious relationships, and it explains a startling fraction of “surprising link” stories.

How many things were compared? One correlation reported from a search of thousands is not a finding; it is the survivor of a lottery. Ask what else was tested and quietly set aside.

A correlation answers “do these move together?” It never answers “why?” — and the machine proves the first question can be a resounding yes when the honest answer to the second is “no reason at all.”

That is the whole of it. Correlation is a question, not an answer; trends make the question lie; and a long enough search will always turn up a beauty. The rest of the compendium is built on the same habit — distrust the pattern until you understand the process that made it.

Two random number generators, walking hand in hand.