series · 9 articlesstarted May 2026
Evals for AI Products
An eval is a test for an LLM feature: a list of inputs, the answers you expect, a way to score what came back. Most teams shipping AI features skip this step — and find out the model regressed via a customer complaint. This series is the working engineer's tour of the field. Article 1 is the primer. Articles 2 through 4 cover the craft of one good eval — looking at the data, calibrating an LLM judge, building a dataset you can maintain. Articles 5 through 7 scale the practice — wiring evals into CI, reading them with statistical care, and running the production-trace flywheel that keeps the suite alive. Articles 8 and 9 cover what's coming next: agent and multi-turn evals, and what the public benchmarks honestly mean.
How to know if your AI is actually any good — a primer on evals for LLM products
an eval is a test for an LLM feature: a list of inputs, the answers you expect, and a way to score what came back. three kinds, and a monday-morning recipe.
9 min
an eval is a test for an LLM feature: a list of inputs, the answers you expect, and a way to score what came back. three kinds, and a monday-morning recipe.
How to find the failure modes your eval set will actually catch — a primer on error analysis
most teams write eval sets by guessing what could go wrong. the fix is reading 100 actual outputs first. open coding, axial coding, the saturation rule, and a monday-morning recipe.
10 min
most teams write eval sets by guessing what could go wrong. the fix is reading 100 actual outputs first. open coding, axial coding, the saturation rule, and a monday-morning recipe.
Your LLM-as-judge has a palate too — calibrating the model that grades the model
an llm-as-judge inherits every bias an llm has — position, verbosity, self-preference. calibration is what turns 'another llm scored it' into a measurement.
11 min
an llm-as-judge inherits every bias an llm has — position, verbosity, self-preference. calibration is what turns 'another llm scored it' into a measurement.
How to build an eval set you can actually maintain — a primer on eval-set construction
an eval set is a dataset, not a script. coverage, balance, anti-leakage, versioning — the four disciplines that turn a list of prompts into something you can ship a product on.
11 min
an eval set is a dataset, not a script. coverage, balance, anti-leakage, versioning — the four disciplines that turn a list of prompts into something you can ship a product on.
An eval suite by Friday: LLM evals in CI by Monday standup
the worked-example tutorial: 30 prompts, three test types, one CI workflow file. get llm evals into your build pipeline by monday standup with about 30 lines of code.
13 min
the worked-example tutorial: 30 prompts, three test types, one CI workflow file. get llm evals into your build pipeline by monday standup with about 30 lines of code.
When 84% beats 81%: statistics for eval engineers
error bars on a pass rate, paired comparison, and sample-size planning — the statistics subset that decides whether your eval improvement is real or noise.
12 min
error bars on a pass rate, paired comparison, and sample-size planning — the statistics subset that decides whether your eval improvement is real or noise.
Production traces are your eval set — the LLM eval maintenance flywheel
an eval suite without a feedback loop becomes shelfware in three months. sample real traces, anonymize, label, fold back. the loop is mundane; running it is the moat.
10 min
an eval suite without a feedback loop becomes shelfware in three months. sample real traces, anonymize, label, fold back. the loop is mundane; running it is the moat.
Evals when your model uses tools — a primer on agent and trajectory evals
when your model calls tools and decides what to do next, a single grade on the final reply isn't an evaluation — it's a guess. the four checks an agent suite needs, and a python skeleton you can wire in this week.
12 min
when your model calls tools and decides what to do next, a single grade on the final reply isn't an evaluation — it's a guess. the four checks an agent suite needs, and a python skeleton you can wire in this week.
LLM benchmarks honestly read: MMLU, SWE-bench, GPQA & friends
the chart your boss screenshots is the lab's marketing surface, not your eval suite. what each public benchmark — mmlu, swe-bench, gpqa, humaneval, bfcl, tau-bench, osworld, helm — actually measures, and when its score really does track your product.
9 min
the chart your boss screenshots is the lab's marketing surface, not your eval suite. what each public benchmark — mmlu, swe-bench, gpqa, humaneval, bfcl, tau-bench, osworld, helm — actually measures, and when its score really does track your product.