lainlog
back

part 5 of 9evals series →

A practical tutorial — Article 5 of the evals primer

An eval suite by Friday.

·13 min read

1 · what this is aboutWhat this article is, in one paragraph.#

This is the worked example. One product, thirty prompts, three unit tests, two LLM-judge prompts, and one CI workflow file — end-to-end, ready by Monday standup. The whole thing is about thirty lines of code. You won't need a framework, a platform, or a vendor pitch deck. You will need a Friday.

Most teams shipping AI features have an eval set somewhere. It lives in a notebook on someone's laptop, or as a Slack thread of bad outputs, or as a vague intent in a planning doc. It never makes it into the build. The path from I have a notebook of prompts to the build fails when the prompt regresses feels like a quarter of work. It isn't. It's a Friday.

This is for engineers who already accept that evals matter and want the end-to-end recipe — the one that turns a notebook of test cases into a passing CI check. By the end you'll have a working pull-request-blocking eval suite for one real product, and the shape of it generalises to every other LLM feature on your roadmap.

2 · the productThe system under test.#

Pick a product. We'll use a B2B SaaS support assistant — a help-desk LLM serving a fictional company that sells subscription software. The assistant handles refund-policy lookups: a customer asks can I get a refund for my Annual Pro plan after 45 days?, the model reads the policy doc, and answers with a citation. One scenario, one product surface. The lessons generalise.

The policy is short. Refunds within 30 days: full. Days 30 to 60: prorated. After 60: denied. Plus an out-of-scope branch for products the policy doesn't cover. Five categories of customer question, all with deterministic right answers if you can read the policy doc.

The system is a single Python file. The function takes a customer question and the policy doc; it returns an answer plus a section citation. About fifteen lines of glue around one chat-completion call.

support_bot.pypython
# support_bot.py — the system under test
from openai import OpenAI

client = OpenAI()

SYSTEM = """You answer customer questions about refunds.
Use only the policy doc provided. Cite the section number.
If the request is outside the policy, refuse politely."""

def answer(question: str, policy: str) -> dict:
    """Return {answer, citation} for a customer question."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"Policy:\n{policy}\n\nQ: {question}"},
        ],
    )
    text = resp.choices[0].message.content or ""
    return {"answer": text, "citation": _extract_section(text)}

The failure mode in production: the model invents a loyalty exception that doesn't exist in the policy. It happens maybe one call in ten — fluent, plausible, wrong. The policy says denied after 60; the model says denied after 60, but loyal customers may receive consideration on a case-by-case basis.No such clause exists. This is the regression we're going to catch.

Everything in this article is a test for that one system. The is the policy doc plus thirty real customer questions. The assertions are about what a correct answer must contain and must not contain. That's it.

If you can hold the system in your head, you can write the tests for it.

3 · friday morningPull thirty prompts.#

You don't write evals from scratch. You start by reading. The first move on Friday morning is to find thirty real customer questions the assistant has handled — or would handle if it shipped — and copy them into a JSON file. Thirty is the sweet spot. Twenty is too few to surface failure-mode patterns; fifty is more grading than you need to start.

Where the questions come from, in order of value: your trace store first; your support-ticket inbox second; synthetic edge-cases third; the demo prompts your sales engineer wrote, last. Real prompts beat synthetic ones because they carry the shape of how customers actually phrase things — the typos, the three-questions-in-one, the tone. The synthetic ones you write in a quiet office never look like that.

For each prompt write down two things: the input, and the expected behaviour. The expected behaviour is either an exact string the answer must contain (a section number, a refund amount) or a small list of yes-or-no questions a careful grader would ask. Tag each row with one of five categories: refund-in-window, refund-out-of-window, refund-prorated, out-of-scope, escalation. Six prompts per category. Ten minutes per row, sustained, for an hour and a half.

The discipline is to put the borderline cases in first. The easy ones — can I get a refund within seven days? — do nothing for you. The model will pass those. Put in the ones where the policy is ambiguous, where two clauses overlap, where the customer is angry or confused. The eval set is where you stress-test the model; it earns its keep on the questions that are hard to answer.

The schema is small enough to fit in a single JSON file. A row is { "input": ..., "expected": ..., "category": ... }. Pytest loads it as a session-scoped fixture; every test asks for it by name and gets the same list.

Set aside ten of the thirty as a holdout. The other twenty are fair game for prompt iteration; you read their failures, you tweak the system prompt, you re-run. The holdout you don't look at while iterating. It's what tells you whether your prompt change helped generally, or just memorised the cases you were debugging. When the holdout score moves up alongside the iteration set, the change is real. When it doesn't, the change is decoration.

A second discipline: tag every row with the source it came from — production trace, support ticket, synthetic — and the date you added it. The tags are free (a column in the JSON); they pay off the first time you ask which prompts have we added since shipping? or are the failing rows the ones we recently added, or the ones that have been here from day one?

You now have the suite that fits in one file. Lunch.

4 · friday afternoonThe three test types, applied.#

Three test types, in this order: deterministic checks first, then LLM-as-judge for the squishy questions, then a small slice of human grading. Apply them in this order, and stop the moment they catch what you need. Most prompts only need the first kind.

4.1 — Three unit tests.

Three deterministic checks. Each one asks a yes-or-no question you can answer with code, not with another model. Together they cover the failure modes you already know: missing citations, invented exceptions, wrong refusal format.

evals/test_unit.pypython
# evals/test_unit.py — three deterministic checks
import re
from support_bot import answer

SECTION_RE = re.compile(r"Section\s+\d+(\.\d+)?")

def test_cites_policy_section(eval_set, policy):
    """Every in-scope answer must cite a Section X.Y reference."""
    in_scope = [p for p in eval_set if p["category"] != "out-of-scope"]
    for p in in_scope:
        out = answer(p["input"], policy.read_text())
        assert SECTION_RE.search(out["answer"]), p["input"]

def test_no_invented_loyalty_clause(eval_set, policy):
    """The model must not invent a 'loyalty exception' the policy lacks."""
    out_of_window = [p for p in eval_set if p["category"] == "refund-out-of-window"]
    for p in out_of_window:
        out = answer(p["input"], policy.read_text())
        assert "loyalty" not in out["answer"].lower(), p["input"]

def test_refusal_format(eval_set, policy):
    """Out-of-scope requests must contain 'outside the policy'."""
    out_of_scope = [p for p in eval_set if p["category"] == "out-of-scope"]
    for p in out_of_scope:
        out = answer(p["input"], policy.read_text())
        assert "outside the policy" in out["answer"].lower(), p["input"]

That's the whole unit-test layer. Three asserts, sixty lines. They run in seconds. They catch the loyalty-exception on the day a prompt change introduces it, and they cost nothing to keep running forever. The rule of thumb is simple: if you can express the question as a regex or a function, you can express it as a unit test. The work is to keep noticing which questions can.

pick the grader · five scenarios
pick the grader · five scenarios1 / 5
commit to a test type. the answer reveals after you pick.
question 1 of 5

The widget above is the rule applied. Tone, intent, multi-part-question handling — those slip past a regex and want a different grader. Citation-existence, refusal-format, arithmetic — those are unit tests every time. The shape of your suite is mostly the deterministic shape; you reach for the more expensive graders only when the question genuinely can't be expressed as code.

4.2 — Two LLM-judge prompts.

For the questions a unit test can't answer, you point a stronger LLM at the output and ask it to grade. Two judge prompts cover most of what unit tests miss for this product: tone, and intent coverage. Tone asks was the refusal polite and professional?; intent coverage asks did the answer address every sub-question the customer actually asked?

A judge prompt is small. Hand it the input, the model's output, and a four-question rubric. Did the output cite a policy section? Was the citation correct? Did it refuse appropriately when out of scope? Was the tone professional? Each yes is a point. The row passes if it scores three or four out of four. The grader returns a JSON blob with the rubric scores and a one-sentence justification per row; you parse it and assert on it the same way pytest asserts on any other return value.

Use a stronger model for the judge than the system under test. The grading model is allowed to be slower and more expensive than the production model — its job is to think carefully about thirty outputs once, not answer five hundred customers a minute. A common pattern is the production model on gpt-4o-mini or equivalent, the judge on gpt-4o or Claude Sonnet. The judge sees no model-name labels — judges show measurable bias toward their own model family if you let them.

The cost discipline matters here. At thirty prompts × two judges × a few cents per call, one full eval run costs a dollar or two in API spend at current frontier-model rates. Acceptable nightly; not acceptable on every commit. Run the unit tests on every PR; run the judges nightly, or on a 1% production sample, or only on the holdout. The CI gate that fires on every push is the deterministic layer; the LLM-judge layer fires on a schedule. Two thresholds, two cadences, one suite.

And don't trust the judge until you've calibrated it. A judge that disagrees with a human grader on three out of thirty cases is reporting something other than quality — probably its own preferences for verbosity, citation density, or its model family's default phrasing. Use the calibration recipe from the judge article (linked at the end): sample thirty graded outputs, have a human re-grade them blind, measure agreement, iterate on the rubric until the number stabilises. Until the agreement number is stable, the dashboard is decorative. A judge isn't truth. It's a faster grader.

4.3 — Five human-graded rows.

Five prompts get a human grader every Friday. Pick the hardest five — the ones where you don't trust either the regex or the judge. The work is twenty minutes. The output is a sanity check on the LLM-judge: if the human and the judge disagree on a row, the judge prompt or the rubric needs work.

If your suite is small enough that a human can grade the whole thing in an afternoon — and thirty rows is borderline — skip the LLM-judge layer entirely until the suite outgrows it. The cheapest reliable grader is a person reading thirty outputs.

5 · friday eveningWire it into the build.#

A test that doesn't run on every PR is a test that has failed. The eval suite has to run automatically — on the same trigger as the unit tests, with the same authority to block a merge — or it stops mattering within a sprint. People mean to run it. They forget. A workflow file remembers for them.

The mechanic is small. A job triggered by pull_request, paths-filtered to the prompt files and the eval directory, runs pytest evals/, and exits with the suite's exit code. The same exit code that turns a normal pytest run red turns the PR check red. About twenty-five lines of YAML; one file in .github/workflows/.

The next decision is the . Two flavours. Binary: fail the build on any single test failure. Graded: fail the build only when the pass rate drops below a number, say 80%, or below the previous run's score by some margin. The opinionated answer is binary on unit tests, graded on judge tests. Unit tests are deterministic; one fail is a real fail. Judge tests are noisy; require pass-rate ≥ baseline minus two points or so. A single judge-test flake on a graded threshold is noise; a single unit-test failure is a customer about to file a ticket.

The graded threshold needs a baseline. The simplest implementation: keep a small JSON file in the repo —.evals-baseline.json— with the last shipped pass rate per category. The CI job compares today's run against it and fails if the drop is bigger than the noise floor. When you ship a real improvement, you update the baseline in the same PR. It's about ten lines of Python on top of pytest. No vendor required.

One trap to avoid. Don't set the threshold at 100% on the unit-test layer if your test set has any rows the model legitimately struggles with — those rows are the ones earning their keep, and you don't want to delete them just to make the build green. Either keep them and accept that they fail (tag them as known-bad with @pytest.mark.xfail), or move them to the judge layer and let the graded threshold absorb the noise. The suite that fails on real bugs and not on known limits is the suite people trust.

A word on framework choice. The pytest-plus-YAML route above is the lightest. If you prefer a config-first tool, Promptfoo's YAML is shaped for the same loop: providers, prompts, tests, assert— declarative, runs from the CLI, integrates with GitHub Actions in about the same number of lines. If you want a hosted dashboard with built-in experiment tracking and the production-trace-to-eval flywheel wired up for you, Braintrust's SDK has Eval(project, data, task, scores) as the central object and a bt evalCLI; LangSmith is similar. The loop is identical across all three. The choice is about what surface you want to maintain — a YAML file, a Python file, or a SaaS dashboard — and what you're willing to pay for. Start with whichever lets you keep the code in your own repo for free; graduate when you outgrow it.

What this looks like in practice is the widget below. A PR opens that tweaks the refund-policy prompt. The eval job fires. Three of thirty fail — the loyalty-exception ones, plus a math drift on the prorated refund. The PR is blocked.

ci · what a regression looks like
ci · what a regression looks like0 / 30
the workflow is paths-filtered. opening this PR triggers the eval job because the prompt file changed.
frame 1 of 4

The loop has now closed. Every prompt change runs through thirty tests; every regression blocks the PR. No one has to remember anything. No one has to ask Friday-morning-you whether the change is safe. The workflow file is the question and the answer.

6 · mondayWhat you ship.#

By standup you have a number. The number is the pass rate on your thirty-prompt suite, attached to the most recent commit on main. You can name it in a sentence: the assistant is passing 28 of 30 evals; the two failures are refund-out-of-window edge cases we're tracking.Last week that sentence was an opinion. This week it's a link to a GitHub Actions tab.

What you have: a thirty-prompt eval set, a passing CI run, a dashboard (the Actions tab is the dashboard), and a number you can put in a slide. What you don't have: a perfect eval set, an aligned LLM-judge, or a system that catches every bug. You especially don't have a suite that passes 100% — and if you do, by next Friday you should have added rows the model fails on. A suite at 100% has stopped finding things.

The Monday move is to share the link. Send the PR plus the Actions tab to whoever owns the prompt. Send it to the product manager who's been asking did the model get better? Send it to the engineer who shipped the prompt change last week without a way to grade it. From now on, every prompt change is graded. Every model swap is graded. Every deploy comes with a number.

What happens next is the part most teams miss. The eval set isn't a one-time deliverable; it's a row in your weekly cadence. Every time a customer reports a bad answer, that prompt becomes a row in the JSON file. Every time a failure mode shows up in production traces, the pattern becomes a category. The suite grows. The threshold tightens. The number you say in standup gets sharper. Six weeks in, the set is closer to a hundred prompts than thirty, and the questions you're catching are the ones you couldn't have predicted on Friday.

The thing every “we should have evals” conversation is missing is the workflow file. You wrote the workflow file. The conversation is over.

For the three-kinds-of-evals primer behind every step of this tutorial, read the primer. For the calibration recipe — sample thirty, blind-grade, measure delta — read the judge article. Then come back here.

Open .github/workflows/.