Why can a 99% accurate test still be usually wrong?

Because accuracy describes how the test behaves on known cases, not how likely a positive result is to be correct — and for a rare condition the healthy majority produces far more false positives than the few true cases. If 1% of people have a condition and the test is 99% accurate, most positive results still come from healthy people. This is the base-rate fallacy: a result is only as meaningful as how common the condition was before testing.

The Base-Rate Fallacy — unspurious: a compendium of statistical illusions

01 · What just happened

Two percentages pointing in opposite directions

"99% accurate" describes the arrow from disease to test: given that you are sick, the test almost certainly says so. But a patient holding a positive result needs the arrow running the other way: given that the test said so, am I actually sick? Those are different numbers, and when the condition is rare they are not even close.

The reason is brute arithmetic. At a 1% base rate, the healthy outnumber the sick ninety-nine to one. Even a small 5% error rate applied to that enormous majority produces a heap of false positives large enough to swamp the handful of true ones. In the figure above, roughly fifty healthy people get flagged alongside ten genuinely sick ones — so a positive result is right less than one time in six.

Nothing about the test is wrong. What's wrong is reading a statement about the sick as if it were a statement about the flagged.

Every accuracy claim is a fraction, and a fraction has a direction. Before being impressed by one, ask which way it points — and how rare the thing being hunted is.

02 · The doctors got it wrong too

Eddy's mammography problem

In 1982 the physician and decision scientist David Eddy put a version of this question to doctors: a woman's routine mammogram comes back positive; the cancer base rate is about 1%, the test catches roughly 80% of cancers and falsely flags about 10% of healthy women. What's the chance she has cancer? Most physicians answered around 75%. The right answer is under 10% — and study after study since, notably by Gerd Gigerenzer's group, has found the same wild overestimate among medical professionals.

Gigerenzer also found the antidote: stop talking in percentages and start counting people. Phrased as "natural frequencies" — out of 1,000 women, 10 have cancer, 8 of them test positive, and so do about 95 of the 990 without it — the answer practically computes itself: 8 true positives out of roughly 103 flags. The tree below is the whole argument.

The mammography problem as a counting tree Eddy's numbers: 1% prevalence · 80% sensitivity · 90% specificity

Fig. 2 — Count people, not percentages. Follow 1,000 women through the screening. The positives (circled) collect 8 women with cancer and about 95 without — so a positive mammogram means cancer roughly 8% of the time, not 75%. Same information as the percentages; radically harder to fool yourself with.

03 · The same test, everywhere on this curve

A positive result is only as strong as its base rate

Hold the test fixed — 99% sensitivity, 95% specificity — and let only the base rate move. The probability that a positive is real crawls along the curve below: a coin-flip's worth of evidence near 5% prevalence, near-certainty among high-risk patients, and almost worthless in mass screening of the healthy.

This is why the same test can be excellent in a specialist clinic and misleading as a population dragnet, and why doctors retest, combine tests, or test only the symptomatic. The instrument doesn't change between those settings. The base rate does.

Probability a positive result is true, by base rate Sensitivity 99% · specificity 95%, held fixed throughout

Fig. 3 — The curve nobody prints on the box. Below roughly 5% prevalence, this "99% accurate" test is wrong about most of the people it flags. The dashed line marks the coin-flip threshold.

04 · When the fallacy reaches a courtroom

The prosecutor's fallacy

The deadliest version swaps the conditionals in front of a jury. "The probability of this evidence arising by innocent coincidence is one in millions" gets heard as "the probability the defendant is innocent is one in millions". Those are different claims — the first ignores the base rate of guilt entirely, and when the pool of innocent explanations or innocent people is large, the gap between them is enormous.

The most cited British example is the case of Sally Clark, convicted in 1999 after two of her infant sons died suddenly. An expert witness told the jury that the chance of two natural cot deaths in such a family was one in 73 million — a figure that both squared away the known clustering of such deaths in families and, more fundamentally, was never the number the jury needed. The relevant comparison was between two rare explanations: double cot death against double murder, the latter rarer still. The Royal Statistical Society protested publicly, and the conviction was quashed in 2003.

The same structure haunts DNA database trawls: a one-in-a-million match probability sounds damning, until a database of several million people is searched and a handful of innocent matches are statistically guaranteed.

05 · Field notes

Rare things, flagged by the million

Security screening. Hunt for something with a base rate near zero — a terrorist among air passengers, say — and even a fantastically accurate detector will flag almost exclusively innocent people. The arithmetic of the curve above guarantees it before a single sensor is built.

Workplace drug tests and spam filters. Both live or die by the base rate of what they're screening. A filter that's 99% accurate against rare spam in a clean inbox mostly flags real mail; the same filter on a flooded inbox performs beautifully.

Mass medical screening. Screening millions of symptomless people for a rare disease produces false positives by the thousand — which is why a positive screen is the start of a diagnostic process, never the end, and why screening programmes weigh the anxiety and harm of those false alarms in the balance.

The habit that protects you costs one sentence: before reading any test result, ask how common the thing was before the test. Then count people, the way Figure 2 does. The fallacy survives on percentages; it rarely survives a head-count.

The test is 99% accurate. Your positive result probably isn't.