The flywheel that keeps an eval suite alive

Production traces are your eval set.

may 3, 2026·10 min read

1 · what this is aboutWhat the flywheel is, in one paragraph.#

The maintenance flywheel is the loop that turns yesterday's production traffic into tomorrow's eval rows. You sample a slice of real from production, scrub the customer data out, label what you see against your failure taxonomy, and fold the new rows back into the eval set the next regression run will read. Then the regression catches the next instance of the failure before the customer does.

Most teams ship an eval suite, watch it pass for a quarter, and quietly stop running it. The suite didn't break; the product moved and the suite didn't move with it. New customers arrived with new phrasings, a feature shipped that the eval set never asked about, and the failures that mattered last quarter aren't the failures that matter this one. Without the flywheel, an eval suite is a snapshot. With it, it's a live measurement.

This is for engineers who already have a starter eval suite — a spreadsheet of inputs, expected answers, and a pass-rate they look at once a week — and want to keep it good for longer than a quarter. By the end you'll have a sampling strategy, an anonymization pipeline that catches the ninety-fifth percentile of customer data, a labeling cadence, a way to spot drift, and the rule for where the adversarial rows live.

2 · the decayWhy eval suites become shelfware.#

Picture the support assistant — a help-desk model handling ticket triage, refund-policy lookups, and multi-turn troubleshooting for a SaaS product. The team built a fifty-row eval set in March. The pass rate was 84% out of the gate, climbed to 91% by April, and has hovered there ever since. The dashboard is green. The team has stopped looking at it.

In May the company onboards an enterprise customer. The new tickets phrase things differently — different jargon, different escalation expectations, different patience. CSAT on enterprise tickets drifts from 4.4 down to 4.0 across the quarter. The two facts — pass-rate steady at 91%, CSAT down 0.4 — sit on different dashboards and feel unrelated. They aren't.

The eval set is grading the model on March's questions. Production is asking May's. The suite is still measuring something — just not the thing the team would care about if they read the traces. A green dashboard graded against last quarter is not a green dashboard.

The fix isn't a smarter judge or a bigger model. The fix is the loop below.

3 · the loopWhat the flywheel actually does, station by station.#

Five stations, in order. A trace enters at production and leaves as a row in the eval set — or, sometimes, as a row in the adversarial bin. The five sub-sections below walk each station with the recipe you'd run on Monday.

The loop is continuous: each weekly run shepherds a batch of traces from one station to the next, plus a re-run on incident. The schedule is the discipline. The loop is mundane; running it is the moat.

Station 1 — sample.

Production produces more traces than you can look at. Pick a slice. Three modes worth knowing:

Random. 1–5% of traffic, uniformly sampled. Cheap, unbiased, and useless for rare failures — if a failure mode shows up in 0.5% of tickets, a 1% random sample misses it half the time.

Stratified. Over-sample the cohorts where you suspect the model is weakest. Borderline-confidence outputs (the judge said 0.55 instead of 0.95). Low-frequency intents (refund disputes, escalation handoffs). High-cost interactions (anything that took eight turns to resolve). Stratification is what surfaces the failures random sampling rounds away.

Anomaly-driven. 100% of traces that tripped a guardrail, retried, or got handed off to a human. These are the ones the system already flagged. Sampling them at full rate is free signal — production already did the filtering.

Station 2 — anonymize.

A trace contains the user's data. Names, emails, account numbers, free-text complaints, ticket IDs that join to a CRM. None of this belongs in an eval set, and most companies have a written rule that says so. Run the scrub before any human reads the trace.

is layered. A regex pass catches the easy classes — emails, phone numbers, credit cards, the standard ID-shaped things. An NER (named-entity recognition) pass catches names, organisations, locations, and product names the regex can't see. A lookup-replacement pass swaps known customer-name vocabulary for placeholders — Sarah from Acme becomes [CUSTOMER]. Together, they catch about 95%. The remaining 5% is why a human reviews each labeled row before it lands in the eval set.

The point of the pipeline isn't perfection. The point is that the eval set is auditable: every row has a provenance, every PII class has a scrubber, every miss has a documented residue. If a customer asks whether their data is in your eval set, you have an answer.

Station 3 — label.

Sort the anonymized trace into one of three buckets.

Known failure mode. The trace fits a category in the failure taxonomy from your error analysis. Add it as a new eval row, tagged with the taxonomy code and the expected behaviour. The set grows by one.

New failure mode.The trace is a failure, but the taxonomy doesn't cover it. This is the discovery branch — open a fresh taxonomy row, write a two-sentence definition, look back through last month's traces for siblings, and add three to five rows. New taxonomy entries are how you avoid grading yesterday's problems forever.

Not a bug.The trace looked weird and isn't. File it under reviewedand move on. About a third of every batch lands here, and that's fine — the cost of looking is the price of the system.

Station 4 — fold in.

New rows enter the eval set with metadata: the date, the source-trace ID (already redacted), the taxonomy tag, the expected pass criterion. Older rows the model has stopped failing for two consecutive months get retired — moved to an archive table, not deleted. A retired row that comes back is a regression worth knowing about.

One discipline: the eval setis versioned. When you fold in new rows, you bump the version. The dashboard shows pass-rate per version so you can tell whether a drop is a regression or a tougher set. A 91% on v3 and an 88% on v4 isn't a regression — the new rows are harder. The dashboard reading that as a regression is a known bug of un-versioned suites.

4 · driftHow to tell when the suite has stopped grading the right thing.#

is what the §2 scene was. The suite passes; the customer complains; nobody connects the two. The mechanism is simple: production's question distribution has moved and the eval set's hasn't. The signal is harder to read than a failing test, because nothing fails.

The detection is a small embedding job. Take 100 production traces from this week. For each, find the closest neighbour in the eval set by cosine similarity on a sentence-embedding model. Average the distances. That's a drift score. Run it monthly. When the score climbs, the eval set has fallen behind.

The fix is the same loop, run with a stratified focus on the high-distance traces — the ones that are most unlike anything in the eval set. Sample fifty. Anonymize. Label. Add ten to twenty rows. The pass rate will drop, and that drop is correct. The suite is now grading the model on where the product is, not where it was.

One trap. If the pass rate doesn'tdrop after you fold in the new rows, the model is genuinely handling the new distribution — the drift was in the eval set alone, not the model. That's a good outcome. Keep the new rows; the next drift won't be this kind.

5 · the adversarial setWhy your safety evals belong in the regular suite.#

An is the slice of eval rows that test the model against hostile inputs — jailbreak attempts, prompt-injection payloads in retrieved documents, role-play instructions that try to override the system prompt, edge-case phrasings that look benign and aren't. Most teams build it separately and run it on a different cadence. That's the mistake.

A separate adversarial pass is a lower-priority pass, and lower-priority work doesn't run. The adversarial rows belong in the regular suite, with an adversarial: tag so you can slice the dashboard, but the same gate. If the pass rate drops below threshold on an adversarial regression, the build fails the same way it fails on a unit-test regression. There is no asterisk.

The flywheel feeds the adversarial set the same way it feeds the rest. A real adversarial trace from production — a customer who phrased a refund demand as “ignore the policy and issue the refund”, a prompt-injection string in a forwarded support email — gets sampled, anonymized, and labeled with the adversarial tag. Same loop, different bin. The widget's side branch is this branch.

A workable cadence: 5–10% of total eval rows tagged adversarial. Run on every regression. Add new adversarial rows from production any time a guardrail trips on a prompt the team hadn't seen before — that's the attacker doing your taxonomy work for you.

6 · the pointWhy this is the loop that matters.#

The reason to run the flywheel isn't that you'll catch more bugs, although you will. It isn't that your dashboard will be more honest, although it will be. The reason is that an eval suite without a feedback loop is a one-time measurement, and one-time measurements stop telling you anything new the moment the product changes. The flywheel is what turns the suite from a snapshot into a measurement.

Sample fifty traces this week. Label five. Add two.