A primer on eval-set construction
How to build an eval set you can actually maintain.
·11 min read
1 · what this is aboutAn eval set is a dataset, not a script.#
An is the dataset your eval suite runs against. A list of inputs, the answers you expect, the metadata that tells you which row was added when and why. Most teams treat it like a one-off bug script: write twenty cases that come to mind, paste the model output next to them, declare it done.
That works for the first month. Six months later, the suite passes 98%, customers still complain, and nobody can explain the gap. The set has stopped doing work — and the discipline that would have kept it useful was never put in. The script tells you whether the model passed today; the dataset tells you whether the model is improving across releases.
This is for developers and ML engineers shipping LLM-powered features who already have the three kinds of evals in hand. By the end you'll know what goes in an eval row, the four disciplines that turn a list of prompts into a dataset, and how to grow it from twenty rows to two hundred without losing track.
2 · what's in the setThree columns and a pile of metadata.#
A row in an eval set has three load-bearing parts. The input — the prompt, ideally in the same shape your product feeds the model. The expected — either the exact reference answer, or a rubric of yes-or-no questions a careful grader would ask. The metadata — the tags that turn rows into a dataset.
The metadata is where most teams cut corners and where every downstream pain comes from. At minimum each row carries: failure_mode (which bucket from your error analysis), source (real, synthetic, regression), severity, added_at, and last_passing. Without these tags the file is a list. With them, you can sort, slice, and re-balance — the things that distinguish a dataset from a stack of paper.
Every framework — openai/evals, DeepEval, Promptfoo, Braintrust, LangSmith — disagrees on the bells. They agree on the spine. Pick the framework you like; the dataset outlives it.
3 · the four disciplinesFour disciplines that turn a list into a dataset.#
Four properties separate a useful eval set from a coffee-table book. Each is invisible row by row and obvious in aggregate. Each is engineering hygiene, not magic.
3.1 — Coverage.
A problem is an aggregate problem. A row is not uncovered. A bucket is. The bucket list comes from your error analysis: read a hundred real outputs, code the failures, let the categories saturate. Article 2 walks through it. The taxonomy that comes out is your coverage map.
Picture the support model. Its real failure modes split four ways: ticket triage, refund-policy lookups, multi-turn troubleshooting, escalation decisions. A set with eighteen refund-policy rows and two of everything else looks busy on the dashboard. It tests one thing. Coverage is not about how many rows you have; it is about which buckets they sit in.
To grow coverage without homogeneity, Hamel's tuple method: list the dimensions of variation (channel, severity, customer tier, language, scenario), hand-write twenty tuples that pick one value from each, then turn each tuple into a natural-language prompt as a separate step. The separation stops the synthetic phrasing collapsing into a single voice. Twenty tuples gives you twenty real cases the dashboard had a blind spot for.
3.2 — Balance.
Different sets want different balance. A judge-calibration set wants close to 50:50 pass and fail — a judge that only ever sees passes will calibrate to passes. A regression suite wants the opposite: overweight the cases your last release broke, because that's what you're guarding. A capability set wants whatever distribution matches the customer mix.
Mixing them in one file is the most common reason a dashboard lies. A row tagged source: regression has different stakes than one tagged source: capability; the pass rate that aggregates them tells you neither. Keep them separate, or keep the tag and slice on it before reporting.
3.3 — Anti-leakage.
is when an eval prompt has already reached the model. The score on a leaked prompt is not a measurement; the model is quoting an answer it remembers. Three sources, in decreasing order of obvious-and-still-overlooked:
- Public benchmarks. MMLU, GSM8K, HumanEval — their items are everywhere on the open web by now. Any frontier model has trained on them. A 95% score on a leaked benchmark is a memorization test, not a capability test.
- Your own demos. If your team has shown the model the same five prompts fifty times in dev, the model has those prompts in its accumulated context. They are no longer hold-out cases; they are warm-up cases.
- Public repos.Eval files committed to a public repo get crawled. Six months later they're in a training set. Keep eval data in a private repo, prefix the file with a canary string, and never paste rows into a system prompt or a public issue.
The narrow word for the survivors is — the small subset you re-grade every release, never used for prompt-tuning. The phrase comes from classical ML where it means a train-test split; in eval engineering it means the prompts you have not let your team optimize against. If a prompt has been used to tune anything, it is no longer holdout — it is feedback.
3.4 — Versioning.
An eval set is software. It deserves the same discipline as code. Three rules cover the lifecycle:
Cases enter by addition, not by feel. When a customer flags a bad answer, that prompt joins the set — along with a failure_mode tag and a one-line note about why. The set grows by what is failing, not by what the author imagined might fail.
Cases retire when the suite passes them three releases running. Prompts that always pass have stopped doing work. Move them to a regression file (so a regression still shows) and add a harder case in the same bucket. A useful eval set is one that occasionally tells you no.
Every change is a commit. Diffs of an eval set read like a release log of what your team prioritized this quarter. New rows under refund-policy mean refund failures were the hot bug; rows retired under triagemean triage stabilized. The git log is the history of your model's actual problems.
4 · sort the candidatesWhat the discipline looks like at row level.#
Put twelve candidates on the table. Some real, some synthetic, some leaked. The job is to decide what ships, what holds for review, and what gets rejected. Cycle each one. Watch the footer.
The lesson lands in the footer. You can't see whether the set is balanced by looking at any row — only by counting them all. A row reads as a single judgment call. The dataset reads as a discipline.
5 · how to start, on mondayHow to grow this from twenty prompts to two hundred.#
The recipe below is small enough to run in an afternoon and durable enough to outlast every framework choice you'll make later.
Twenty rows is enough to start. It isn't enough to ship a product on, but it's enough to make the four disciplines visible — at twenty rows, an unbalanced set is obvious; a leaked prompt is obvious; a missing bucket is obvious. Once you can see them, the rest is mechanical.
6 · the pointWhat an eval set is for.#
A script tells you whether the model passed today. A dataset tells you whether the model is improving across releases — and which failure modes are getting better, worse, or are still uncovered. The disciplines are how the dataset stays a dataset. Coverage, balance, anti-leakage, versioning — write them on the wall. Every story your dashboard tells lives or dies by them.
Open the file.