lainlog
back

part 9 of 9evals series →

The benchmarks every dev sees, and what they actually measure

LLM benchmarks, honestly read.

·9 min read

1 · what this is aboutWhat this article is, in one paragraph.#

A is the chart your boss screenshots from a model release post. Eight names show up in almost every chart in 2026 — MMLU, SWE-bench, GPQA, HumanEval, BFCL, TAU-Bench, OSWorld, and HELM. They are not your eval suite. They were not built to tell you whether a model will help your product, and they mostly don't.

Most teams treat the table as decision data. They shouldn't. The numbers are the lab's marketing surface, optimised for headlines on a fixed set of public questions that have leaked into the training corpus over the past three years. A 92% on MMLU is not a 92% on your task. A four-point gain on SWE-bench Verified is not a four-point gain on your repo. The benchmark table is the lab's marketing surface, not your eval suite.

This is for developers and product folks shipping LLM-powered features who keep getting forwarded benchmark charts. By the end you'll have a working mental model of what the canonical public benchmarks actually measure, why their numbers don't track your product's quality, and the small handful of cases where a public score genuinely correlates with whether your feature works.

2 · the failure modeWhy a 92% on a public benchmark doesn't mean what the chart implies.#

A model release post lands in your team chat. The chart says +4 points on MMLU, +2 on GPQA, +9 on SWE-bench Verified. Your team forwards it around. Decision-makers ask whether you should switch models. You don't yet have a useful answer. The chart was optimised for the forwarding step, not the answering step.

Three things turn the chart into noise. The first is — every canonical knowledge benchmark has measurable membership signal in public pretraining corpora. See Sainz et al. (2023) for the cross-benchmark survey, and Roberts et al. (2023) for the time-of-crawl inflation pattern. The second is : labs train against the benchmark's specific harness — its prompt format, its grading rubric, its answer-letter convention — and the gains rarely transfer. The third is scenario fit. The benchmark's prompts come from one fixed distribution. Yours don't.

The consequence is unflattering and worth saying plainly. The correlation between “MMLU went up two points” and “our support assistant got better” is somewhere around zero in expectation. You can still read the table — for what it tells you about the lab's priorities that release. You shouldn't read it for the score.

3 · the eight benchmarksWhat each canonical public benchmark actually measures.#

There are roughly eight benchmarks that show up in every model release post in 2026. They split into three groups — knowledge (MMLU, GPQA), code (HumanEval, SWE-bench), and agents (BFCL, TAU-Bench, OSWorld). HELM sits apart as the meta-eval that argues against single-number leaderboards. Most others were built before frontier models could solve them; the ones that haven't saturated yet will, and the field will quietly retire them.

The chart-friendly question is “which is highest?” The honest one is “what does each one grade by, and does that grading match what your product does in production?” The widget below puts each on its own placard — what it measures, the harness it grades by, and a one-line verdict on whether the score should change anything you do this week.

The benchmark museum
The benchmark museum1 / 8
verdictUse as a sanity floor — most product tasks aren't 4-way MCQ; the score won't track your pass rate.

The pattern across the eight is the same in every case. Each benchmark grades by a specific harness — exact-match on a multiple-choice answer letter, pass@1 on a held-out unit test, AST-equivalence on a tool call, end-to-end test-suite green on a real repo. The harness, more than the prompts, determines whether the score correlates with anything you care about. SWE-bench Verified is the closest of the eight to real engineering work because its harness is the codebase's own tests. Most of the others are chart-friendly because their harness is cheap to grade, not because their harness mirrors your product. The harness, not the headline, is what the score actually means.

4 · when the number means somethingWhen a public benchmark genuinely tracks your product.#

The picture isn't entirely nihilist. There are real cases where a public number does carry signal. Three of them are worth naming, and one coda is worth more than the three.

The first is a new model class. The first time a model cleared 50% on GPQA Diamond was a genuine signal that something different was happening — not because the 50 was a magic threshold, but because the direction had changed. When a benchmark moves from “every model fails” to “some models pass,” that's information. When it moves from 87% to 91%, that's mostly the standard error talking. The second is your task is in the harness. If you ship a code-completion feature whose prompts look like HumanEval's — short Python functions, complete-the-body — then HumanEval pass@1 is, modulo contamination, a noisy proxy. The third is the benchmark's harness is structurally similar to yours. If you ship a function-calling feature, BFCL's AST-equivalence harness is structurally close to “did the call go through with the right arguments,” and a model that can't do BFCL probably can't do yours either.

The coda is more useful than any of the three. The shape of the gain on a release post is more informative than the absolute number. A model release that gains 12 points on SWE-bench Verified and 1 point on MMLU has been tuned for code agents — that tells you something useful about the lab's priorities that release, even if the absolute SWE-bench score is contaminated and the MMLU is saturated. The chart is most useful as a priorities heatmap, not a leaderboard. Read the deltas, not the totals.

Two more terms travel with this conversation and are worth naming once. A holdout is the portion of a dataset deliberately reserved from training — and, crucially for product evals, from your prompt-engineering iteration too. Leakage is the broader term: any path by which test-set information reaches training, whether by direct contamination or by subtler routes like fine-tuning on benchmark-shaped data or eyeballing examples during prompt iteration. Your eval suite earns its name when it has been a holdout from your own iteration, not just from the lab's training run.

5 · the pointYour eval set is the question the leaderboard can't answer.#

The chart your boss forwarded is a museum exhibit. The placards tell you what each benchmark was trying to measure when it was built. Most of those placards are now older than the models you're choosing between, and the questions on the wall have been on the wall long enough that every frontier model has seen them. The question your boss is actually about to ask — is this model better for our product? — is not on any placard. It is on your spreadsheet from the first article. Every answer the leaderboard gives you is an opinion. The one your eval set gives you is a number.

Read the chart for what the lab cares about. Then close the tab and open the spreadsheet.