lainlog
back

series · 9 articlesstarted May 2026

Evals for AI Products

An eval is a test for an LLM feature: a list of inputs, the answers you expect, a way to score what came back. Most teams shipping AI features skip this step — and find out the model regressed via a customer complaint. This series is the working engineer's tour of the field. Article 1 is the primer. Articles 2 through 4 cover the craft of one good eval — looking at the data, calibrating an LLM judge, building a dataset you can maintain. Articles 5 through 7 scale the practice — wiring evals into CI, reading them with statistical care, and running the production-trace flywheel that keeps the suite alive. Articles 8 and 9 cover what's coming next: agent and multi-turn evals, and what the public benchmarks honestly mean.

  1. How to know if your AI is actually any good — a primer on evals for LLM products

    an eval is a test for an LLM feature: a list of inputs, the answers you expect, and a way to score what came back. three kinds, and a monday-morning recipe.

    9 min

  2. How to find the failure modes your eval set will actually catch — a primer on error analysis

    most teams write eval sets by guessing what could go wrong. the fix is reading 100 actual outputs first. open coding, axial coding, the saturation rule, and a monday-morning recipe.

    10 min

  3. Your LLM-as-judge has a palate too — calibrating the model that grades the model

    an llm-as-judge inherits every bias an llm has — position, verbosity, self-preference. calibration is what turns 'another llm scored it' into a measurement.

    11 min

  4. How to build an eval set you can actually maintain — a primer on eval-set construction

    an eval set is a dataset, not a script. coverage, balance, anti-leakage, versioning — the four disciplines that turn a list of prompts into something you can ship a product on.

    11 min

  5. An eval suite by Friday: LLM evals in CI by Monday standup

    the worked-example tutorial: 30 prompts, three test types, one CI workflow file. get llm evals into your build pipeline by monday standup with about 30 lines of code.

    13 min

  6. When 84% beats 81%: statistics for eval engineers

    error bars on a pass rate, paired comparison, and sample-size planning — the statistics subset that decides whether your eval improvement is real or noise.

    12 min

  7. Production traces are your eval set — the LLM eval maintenance flywheel

    an eval suite without a feedback loop becomes shelfware in three months. sample real traces, anonymize, label, fold back. the loop is mundane; running it is the moat.

    10 min

  8. Evals when your model uses tools — a primer on agent and trajectory evals

    when your model calls tools and decides what to do next, a single grade on the final reply isn't an evaluation — it's a guess. the four checks an agent suite needs, and a python skeleton you can wire in this week.

    12 min

  9. LLM benchmarks honestly read: MMLU, SWE-bench, GPQA & friends

    the chart your boss screenshots is the lab's marketing surface, not your eval suite. what each public benchmark — mmlu, swe-bench, gpqa, humaneval, bfcl, tau-bench, osworld, helm — actually measures, and when its score really does track your product.

    9 min