A primer on error analysis for LLM products

Read the outputs. Then read more.

may 3, 2026·10 min read

1 · what this is aboutWhat error analysis is, in one paragraph.#

Error analysis is the part of evals where you read your model's outputs and write down what's wrong with them. Not in aggregate. One trace at a time, in plain English. The last article was about what an eval set is. This one is about choosing what goes in it — and the only honest way to choose is to read what your model has been doing.

Most teams skip this step. They read a guide that says look at your data, they nod, and they go back to their code. The advice is correct; it has just never come with an instruction manual. Looking at the data has a procedure. It is borrowed from qualitative research, it has a stop signal at around one hundred traces, and it is the cheapest unit of useful work you can do this week.

This is for developers and tech leads shipping LLM-powered features who already have an eval set or are about to build one. By the end you'll have the working procedure — and — the saturation rule, and a Monday-morning recipe you can run with a spreadsheet and an hour.

2 · the failure modeWhat goes wrong when you skip this.#

Picture a Friday morning. The team has decided to ship evals. Someone opens a spreadsheet and asks, what could go wrong with this product? Twenty rows later, the eval set exists. It runs. It passes 95%. The team ships.

Two weeks later a customer complains. The model quoted a 14-day refund window as 30 days. Nobody on the team had thought to test refund-window factuality, because nobody on the team had read a trace where the model did this. The eval set passed because the eval set was a list of the team's guesses. It was never a list of the model's actual failures.

The fix is structural, not a matter of being more careful. You cannot list the failure modes you haven't seen. Imagination produces a flat distribution of generic worries. Reading produces a sharp distribution of the specific things your product gets wrong — the ones that repeat, the ones that cluster, the ones a real eval row can be written against.

So the order is reversed from how most teams do it. You read first. The eval set is what falls out. The rest of this article is the procedure for the reading.

3 · the mechanismOpen coding, then axial coding, then stop.#

The procedure has two passes and a stop rule. The first pass is uncomfortable because it asks you to read without a system. The second pass is where the system shows up. The stop rule tells you when reading has stopped paying.

Pass 1 — open coding.

Open coding is the first pass. You read one trace, and you write a short note in plain English about what's wrong with it. Quoted 30 days for the refund window; policy says 14. Stayed in chat after the user asked for a human three times. Misclassified a billing ticket as feature-request. Then you move to the next trace and do it again.

You do not write down a category. You do not set up a taxonomy. You do not even know yet what the categories are. The categories are what you're trying to learn. Pre-imposing them is the failure mode of error analysis itself — you end up labelling traces against your guesses again, and the whole point was to get past your guesses.

The widget below shows the rhythm. Twelve traces from a B2B support assistant — refund-policy lookups, ticket triage, multi-turn troubleshooting, escalation decisions. Read each one, pick the chip that names what's wrong, and watch the tally on the bottom. You're running a mini version of pass one.

trace reader · open-coding rhythm0 / 12 labelled

Twelve isn't a hundred. The rhythm is the rhythm, though: read, name what's wrong in your own words, move on. Do that ninety more times and the categories piling up in the corner of the page will tell you what your product's failure modes are. You will know things about your model that nobody on your team knows.

Pass 2 — axial coding.

After roughly a hundred traces, you stop reading and start clustering. Axial coding is the pass where you group similar notes into named buckets. Notes that say variations of the same thing collapse into one . Notes that don't fit anywhere become their own bucket, or they get flagged as outliers and set aside.

What you end up with is a small — usually 4 to 8 buckets, each with at least three instances. A bucket with two instances is an outlier worth remembering, not a category worth testing for. The taxonomy is product-specific. Two assistants trained on the same model can have entirely different ones.

The taxonomy is also where the eval set comes from. Each bucket becomes one or more rows, and each bucket names the test type that catches it. Format violation is a regex unit test. Hallucinated policy is an LLM-judge with the policy doc in context. Missed escalation is squishy enough to need a human grader on a small fixed sample. The taxonomy is the eval set's table of contents.

The stop rule — saturation.

You stop reading when reading stops teaching you anything. The technical name is — the point at which new traces stop introducing new failure modes and only repeat the ones you've already named. The empirical floor across the practitioner literature is around a hundred traces. The heuristic for when to stop, from Hamel's evals-FAQ, is twenty consecutive traces without a new category.

Saturation is a stop signal, not a target. It tells you the cost of the next trace has finally exceeded its information value. Past that point you should be running your model on the eval set you just built, not still reading. Reading more becomes a way to avoid shipping — and your data has already told you what you needed.

4 · the recipeHow to start, on Monday.#

You don't need a tool. You need a spreadsheet and an afternoon. The recipe below is the smallest version of the procedure that still produces a useful taxonomy. Eugene Yan calls this the scientific method in disguise, and that framing is more accurate than it sounds — observe, annotate, hypothesise, test.

The work is unglamorous. It is also work nobody else can do for you. A contractor or a model can label traces against an existing taxonomy. None of them can build the taxonomy in the first place. Building the taxonomy is how you learn what your product is doing. The labels are downstream of that knowledge, and the knowledge does not transfer through a handoff.

5 · the pointThe point.#

Without error analysis, your eval set is a portrait of your team's imagination. With it, your eval set is a portrait of your product. The difference shows up in one number — the rate at which the customer surprises you. Without it, the customer surprises you first. With it, the eval set surprises you first, and the customer finds a hardened model.

The taxonomy you build today is the dataset you maintain for the next year. Open the spreadsheet, and start reading.