lainlog
back

part 6 of 9evals series →

When 84% beats 81% (and when it doesn't)

Statistics for eval engineers, in three formulas.

·12 min read

1 · what this is aboutWhat this article is, in one paragraph.#

A pass rate is a coin-flip estimate. The around it is the range of pass rates compatible with what you actually measured. This article covers the small subset of statistics that decides whether your eval improvement is real: error bars on a pass rate, paired comparison between two model versions, and how big your eval set needs to be before the numbers can tell you anything at all.

Most teams ship “improvements” by comparing two pass rates and squinting. The new version got 84%, the old one got 81%, the deck says +3 points, the deploy ships. Six weeks later someone runs the eval again and the new version scores 80%. Nobody knows whether the regression is the model, the prompt, the evaluator, or the noise floor — because nobody put error bars on the chart in the first place. Three points is either signal or noise, and the headline number doesn't say which.

This is for engineers shipping evals on a real product — you've read the primer, you have a spreadsheet, you run it on every prompt change. By the end you'll have three formulas, two intuitions, and a Python recipe that turn a pass-rate dashboard from an opinion into a measurement.

2 · error barsWhat an error bar on a pass rate actually says.#

A pass rate is a sample mean of binary outcomes. Each prompt either passed or failed, and you took the average. The standard error of that average — how much it would jitter if you re-drew the same number of prompts from the same distribution — has a one-line formula:

SE = sqrt( p̂ · (1 − p̂) / n )

The standard error of a pass rate scales with one over the square root of n. Quadruple the eval set, halve the noise. A 95% confidence interval is the rate plus or minus 1.96 standard errors — close enough to two for everyday use.

Plug in the numbers. At a pass rate of 81% on 100 prompts, the standard error is sqrt(0.81 · 0.19 / 100) ≈ 0.039, and the 95% interval runs from about 73% to 89%. At 81% on 1000 prompts, the standard error is 0.0124 and the interval narrows to roughly 78.5% to 83.5%. The number on the dashboard is the same. The interval around it is four times tighter.

This is the first thing to internalise. The pass rate is the centre of an interval, not a point.When the new version scores 84% on the same 100 prompts, its interval runs from roughly 76% to 91%. The two intervals overlap by ten percentage points. There's no honest way to say which is better from these numbers alone.

The half-width 1.96 · SE is what the working eval engineer actually wants. At n = 100 it sits between roughly ±6 and ±10 percentage points across the realistic pass-rate range. Any improvement smaller than that is inside the noise, and you cannot tell the post-change rate apart from the pre-change rate. Two pass rates aren't comparable until you've drawn the bars.

3 · signal or noiseWatch the bars overlap and separate.#

Below: model A is fixed at 81%; model B is whatever you drag it to. Slide N from a hand-graded 20 up to a serious 2000. Toggle between unpaired (two independent runs) and paired (the same prompts ran on both models). The whiskers are 95% intervals. The verdict line says whether the comparison would survive a statistician.

overlapping bars · paired vs unpaired
overlapping bars · paired vs unpairedΔ = 3.0 pp · ±10.5 pp
overlapping — the difference is inside the noise floor. Two unpaired runs — every prompt's difficulty enters the variance twice.
cmp

At small N, even a five-point gap is inside the bars. At large N, a one-point gap separates clearly. The toggle tells the same story a different way: pair the runs and the bars on the difference shrink, often enough to flip the verdict from overlap to separated without your touching N at all. Pairing is the cheapest eval improvement most teams haven't made yet.

4 · pair the runsRun both models on the same prompts.#

A runs both versions on the same eval set, then takes the difference per prompt. Some prompts are easy and both models pass them; some prompts are hard and both fail; neither tells you anything about which model is better. The information lives in the prompts where the two models disagree.

The arithmetic is the same shape as before — a sample mean of binary outcomes — but the binary is now did A win on this prompt, not did A pass this prompt. The variance of the per-prompt difference is what determines your error bars; and that variance is reliably smaller than the variance of two independent rates, because question difficulty cancels out.

Evan Miller's 2024 paper Adding Error Bars to Evals calls this a “free” reduction in estimator variance. With realistic correlations between two models' per-prompt scores — Anthropic's blog cites 0.3 to 0.7 for frontier models — the variance of the difference drops by a third to a half. Half the variance is half the required eval-set size. Pairing is exactly as cheap as not pairing: same prompts, same runs, same outputs. You just score the difference, not two means.

For binary outcomes specifically, the canonical paired test is . Build a 2×2 table:

a = both pass · b = A pass, B fail · c = A fail, B pass · d = both fail

Cells a and d — the prompts both models got the same way — carry no information about which is better. The whole comparison reduces to: of the prompts where they disagreed, did A win more often than B? That is a binomial test on b + c trials with null p = 0.5. Two lines of Python.

One last note. McNemar discards the prompts both models agreed on. If b + cis small — say, fewer than 25 — the asymptotic chi-square form is unreliable; use the exact binomial form above. That's why scipy.stats.binomtest is the honest default here, not chi2_contingency. The cheapest improvement to your eval is running both models on the same prompts.

5 · sample sizeHow big does the eval set need to be?#

The third formula is the one you run before you build the eval set, not after. To detect a real δ between two models, with significance α and power 1 − β, the eval set you need is roughly:

n ≈ (zα/2 + zβ)2 · σ2 / δ2

The standard 95%-confidence, 80%-power numbers are zα/2 = 1.96 and zβ = 0.84. The variance σ2 depends on whether you pair — for binary scores around an 80% pass rate, σ² is roughly 0.32 unpaired and 0.16 paired. The shape of the formula is what matters: halve the effect you want to detect, and N quadruples. There is no other way around it.

sample-size calculator · how many prompts do you need?
sample-size calculator · how many prompts do you need?Δ = 3 pp · 95% conf · 80% pow
Pairing cuts the required N by about 50% for these settings. Halve the effect you want to detect, and N quadruples.
conf
pow

The defaults in the calculator answer the question every eval shipper actually has: how many prompts do I need to tell a 3-point change apart from noise? At 95% confidence and 80% power the unpaired answer is around 2,800 prompts; paired, it's around 1,400. Miller's recommendation that “new evals should contain at least 1,000 questions in order to have good signaling ability” is roughly the floor for 5-point unpaired comparisons; pair the runs and you halve it. A 2-point effect, paired, lands you closer to 3,000.

Two practical exits, when the formula tells you a number you can't afford. First: stop trying to detect a 2-point change. Either the change is bigger than that and you'll see it at 1,000 prompts, or it isn't and your time is better spent elsewhere. Second: pair everything you can. The same 1,000 prompts run on both versions teach you twice as much as 500 prompts on each run independently.

And the small-N caveat. Below roughly 100 prompts, CLT-based intervals systematically understate uncertainty. For a 30-prompt human-graded set, drop the formula and use the exact binomial CI from scipy.stats.binomtest shown above. The interval will be wider; that's the point.

6 · the recipeThe recipe — five lines on top of every eval dashboard.#

You don't need a stats team. You need five lines of discipline applied to a number you're already computing.

Three of these five are arithmetic — they cost nothing but the discipline of writing them down. The other two (pairing, sample-size planning) cost a meeting with whoever owns the eval set. None of them cost a model retrain, a vendor licence, or a quarterly roadmap.

One trap to avoid: clustered prompts.

If your eval set has groups of related prompts — five questions about the same contract, four turns of the same troubleshooting flow — the questions within a cluster are correlated. The naive standard-error formula assumes independent draws and understates uncertainty by up to 3×. The fix is clustered standard errors — a topic large enough to skip here. The cheap version: when you sample for an eval set, sample the cluster, not the individual question. One contract, one ticket-resolution flow, one test row.

7 · the pointThe point.#

84% beats 81% when the bars don't overlap, the runs were paired on the same prompts, and N is large enough to resolve the gap. It doesn't when any of those three fails — and the headline number doesn't tell you which. A pass rate without an error bar is an opinion. A pass rate with one is a measurement. The difference is whether you can argue about the rubric, or whether you're still arguing about the noise.

Put the bars on the chart.