A primer on evals for LLM products

How to know if your AI is actually any good.

may 2, 2026·9 min read

1 · what this is aboutWhat this article is, in one paragraph.#

An is a test for an LLM feature. You write down a list of inputs, you write down what a good answer looks like, you run the model, and you score the output. That's the whole idea. The rest of this article is about how to do that well.

Most teams shipping AI features skip this step. They paste a prompt into the playground, look at the answer, decide it reads fine, and ship. That works until it doesn't — usually around the second customer, or the first time someone swaps the model. Then the team is stuck debating whether the new version is “better,” with nothing to point at. Evals are how you get something to point at.

This is for developers and product folks shipping LLM-powered features — chat, search, summarisation, agents, anything where the output isn't deterministic. By the end you'll have a working mental model of what evals are, the three kinds you'll want, and a small recipe you can run on Monday morning with nothing but a spreadsheet.

2 · the failure modeWhat goes wrong when you skip this.#

Picture a doc-Q&A tool. Customers ask questions about a contract, the model answers, and the answer cites a page number. The team built it in a week. The demo went well. Six weeks later, a customer notices the model cited page 14 of a contract — and the clause is on page 27. The answer reads fluent. The number is wrong.

Without evals, three things are now true. One: nobody knows how often this happens. Two: nobody knows whether the new prompt the team shipped on Tuesday made it worse. Three: the only way to find out is to wait for the next customer to complain. The feedback loop is months long and runs through your support inbox.

With evals, the picture changes. You have, somewhere, a list of twenty real questions. For each one, you know what page the right answer cites. When the model gets a page number wrong, a number on a dashboard moves. When you change the prompt, the number moves. When you swap the model, the number moves. You can tell, before the customer can.

That's the whole pitch. The rest is mechanics — what to put in the list, how to score it, and what to do when the list gets too big to grade by hand.

3 · the three kindsThe three kinds of evals.#

There are roughly three kinds of evals, and most teams end up using all three. They get progressively more flexible, more expensive, and harder to automate. Start with the cheap ones.

Kind 1 — Unit tests.

A unit test for an LLM is a deterministic check. You decide, in advance, a yes-or-no question you can ask of the output, and you ask it with code, not with another model. Did the SQL the model produced contain a WHERE clause? Is the cited page number a number that exists in the document? Did the answer mention a competitor we asked it not to mention?

These are cheap. They run in milliseconds. You run them on every prompt change and every model swap, the same way you run unit tests in any other codebase. They're also limited: they only catch things you can express as a rule. They won't tell you whether the answer is helpful, on-tone, or factually correct on something the rule can't see.

Use them wherever you can. They catch the dumb regressions — the ones where a prompt change breaks formatting, drops a required field, or stops citing sources. They're also where most teams should start, because they're the cheapest way to turn vibes into a number.

Kind 2 — .

When the question you want to ask isn't a regex — was the tone polite?, did the answer cover all three points the user asked about? — you reach for a second LLM. You give it the input, the output, and a (a short list of yes-or-no questions), and ask it to grade.

This scales. One judge model can grade thousands of outputs in the time a human grades ten. It's also biased in ways you have to measure. The judge prefers longer answers. The judge scores its own model family higher than rivals. The judge agrees with whatever the question implies. None of this is fixable by writing a better prompt — but all of it is measurable.

Use LLM-as-judge for the questions a unit test can't answer, but only after you've checked its agreement with a human grader on a small sample. Skip it if your eval set is small enough that a human can grade the whole thing in an afternoon — that's cheaper than calibrating a judge, and more accurate.

Kind 3 — Human eval.

Sometimes the only thing that knows whether the answer is good is a person. Tone. Helpfulness. Whether the model actually answered the question the user asked, instead of the question it preferred to answer. Whether the summary leaves out something important. These are questions where a human grader is the gold standard, and where any model — including the judge — is fundamentally downstream of human judgment.

Human eval is slow and expensive. It also doesn't scale: ten people grading a hundred outputs each is a full week of work. So you do it sparingly and in two places. First, on a small fixed set of hard examples you re-grade every release — the ones where you really want to know whether you've regressed. Second, on the calibration sample for the LLM-judge, where you only need to grade thirty.

Skip human eval if a unit test or a calibrated LLM-judge can answer the question. Reach for it when the question is squishy enough that you don't trust either — and accept that some questions will always be squishy. That's not a failure of the eval system. That's the shape of the work.

4 · the recipeHow to start, on Monday.#

You don't need a framework, a platform, or a vendor. You need a spreadsheet and an hour. The recipe below is deliberately small — small enough that you can run it before lunch and ship the result before the day ends.

Note —

Pull 20 real prompts from production. Not synthetic. Not the demo prompts. Real ones, from the last week of traffic. Borderline cases first — the ones where you're not sure what the right answer is.
For each, write what a good answer looks like. The exact answer if you know it. Otherwise, the two or three yes-or-no questions a careful grader would ask — did it cite the right page?, did it stay in scope?, was it polite?
Run the model. Score the output. Score it yourself first. If the question is a rule (“does the SQL contain WHERE”), write a one-line check. If it isn't, mark each row pass or fail by hand.
Track the over time. When it moves, you'll know whether the change you made helped, hurt, or did nothing. Most prompt changes do nothing.
Add hard prompts every week. When a customer reports a bad answer, that prompt goes in the spreadsheet. When you find a failure mode in production, that pattern goes in the spreadsheet. The set grows. The rubric tightens.

Twenty prompts is enough to start. It's not enough to ship a product on, but it's enough to turn vibes into a number. Once you have the number, the rest of the practice — adding cases, calibrating a judge, scheduling human reviews — is mechanical. The hard part is the first row.

One warning. If the suite ever passes 100%, the suite is probably broken — not because your model is perfect, but because the prompts in it have stopped being hard. Keep adding cases the model fails on. A useful eval set is one that occasionally tells you no.

5 · the pointThe point.#

The reason to do any of this isn't to catch bugs, although you will. It isn't to ship faster, although you will. The reason is to be able to answer the question your boss is going to ask. Is the new version better? Is the model we're paying for actually worth it? Did the change you made on Tuesday help? Without an eval set, every answer is an opinion. With one, every answer is a number — and you can argue about the rubric instead of arguing about your taste.

Open the spreadsheet.