series · 9 articlesstarted May 2026
An eval is a test for an LLM feature: a list of inputs, the answers you expect, a way to score what came back. Most teams shipping AI features skip this step — and find out the model regressed via a customer complaint. This series is the working engineer's tour of the field. Article 1 is the primer. Articles 2 through 4 cover the craft of one good eval — looking at the data, calibrating an LLM judge, building a dataset you can maintain. Articles 5 through 7 scale the practice — wiring evals into CI, reading them with statistical care, and running the production-trace flywheel that keeps the suite alive. Articles 8 and 9 cover what's coming next: agent and multi-turn evals, and what the public benchmarks honestly mean.
an eval is a test for an LLM feature: a list of inputs, the answers you expect, and a way to score what came back. three kinds, and a monday-morning recipe.
9 min
an eval is a test for an LLM feature: a list of inputs, the answers you expect, and a way to score what came back. three kinds, and a monday-morning recipe.
most teams write eval sets by guessing what could go wrong. the fix is reading 100 actual outputs first. open coding, axial coding, the saturation rule, and a monday-morning recipe.
10 min
most teams write eval sets by guessing what could go wrong. the fix is reading 100 actual outputs first. open coding, axial coding, the saturation rule, and a monday-morning recipe.
an llm-as-judge inherits every bias an llm has — position, verbosity, self-preference. calibration is what turns 'another llm scored it' into a measurement.
11 min
an llm-as-judge inherits every bias an llm has — position, verbosity, self-preference. calibration is what turns 'another llm scored it' into a measurement.
an eval set is a dataset, not a script. coverage, balance, anti-leakage, versioning — the four disciplines that turn a list of prompts into something you can ship a product on.
11 min
an eval set is a dataset, not a script. coverage, balance, anti-leakage, versioning — the four disciplines that turn a list of prompts into something you can ship a product on.
the worked-example tutorial: 30 prompts, three test types, one CI workflow file. get llm evals into your build pipeline by monday standup with about 30 lines of code.
13 min
the worked-example tutorial: 30 prompts, three test types, one CI workflow file. get llm evals into your build pipeline by monday standup with about 30 lines of code.
error bars on a pass rate, paired comparison, and sample-size planning — the statistics subset that decides whether your eval improvement is real or noise.
12 min
error bars on a pass rate, paired comparison, and sample-size planning — the statistics subset that decides whether your eval improvement is real or noise.
an eval suite without a feedback loop becomes shelfware in three months. sample real traces, anonymize, label, fold back. the loop is mundane; running it is the moat.
10 min
an eval suite without a feedback loop becomes shelfware in three months. sample real traces, anonymize, label, fold back. the loop is mundane; running it is the moat.
when your model calls tools and decides what to do next, a single grade on the final reply isn't an evaluation — it's a guess. the four checks an agent suite needs, and a python skeleton you can wire in this week.
12 min
when your model calls tools and decides what to do next, a single grade on the final reply isn't an evaluation — it's a guess. the four checks an agent suite needs, and a python skeleton you can wire in this week.
the chart your boss screenshots is the lab's marketing surface, not your eval suite. what each public benchmark — mmlu, swe-bench, gpqa, humaneval, bfcl, tau-bench, osworld, helm — actually measures, and when its score really does track your product.
9 min
the chart your boss screenshots is the lab's marketing surface, not your eval suite. what each public benchmark — mmlu, swe-bench, gpqa, humaneval, bfcl, tau-bench, osworld, helm — actually measures, and when its score really does track your product.