A primer on agent and trajectory evals

Evals when your model uses tools.

may 3, 2026·12 min read

1 · what this is aboutWhat changes when the model can act.#

An is a model that gets to make decisions in a loop. It calls a tool, reads the result, decides what to do next, and keeps going until it thinks it's done. The support assistant in the rest of this article is a small one — it can call lookup_subscription, process_refund, and escalate_to_human, and it has to decide which to call and in what order.

The eval techniques from the rest of this series — pass rates, rubrics, error analysis — were built around a model that takes one input and produces one output. Once the model can act, that shape stops fitting. The final reply can read fluent and on-policy while the path the model took to get there leaked a customer's data, called the wrong tool, or quietly ignored an error. Agent eval grades the path, not the destination.

This is for engineers shipping tool-using models — support assistants, code agents, research agents, anything that takes more than one step before it stops. By the end you'll have a working mental model of what a trajectory is, the four checks an agent suite needs, and a Python skeleton you can wire into your runner this week.

2 · the failure modeThe reply was right. The behaviour was wrong.#

A customer named Dana asks for a refund. The assistant needs to find Dana's subscription, check the refund policy, process the refund, and reply. Four steps, three tools. The reply that comes back is fluent: You're eligible — $39 has been refunded to the card on file. The amount is right. The tone is right. A grader scoring the final message would mark it pass.

Watch the trace. At step one, the assistant looked up Dana's subscription by name and got back one match. At step two, it looked up the same subscription again — but this time it passed the subscription id sub_4f12 as the new user_idargument, overwriting the conversation's user reference. From step three on, the assistant was acting on a subscription record whose owner was no longer the customer it was replying to. The refund went through. The receipt went to the wrong address. The reply read fine.

Without a way to grade the path, three things are now true. You can't tell how often this happens. You can't tell whether the prompt change you shipped on Tuesday made it more common. The only signal is a customer noticing they got someone else's receipt — and by then the leak has already happened. The endpoint check stopped looking before the bug appeared.

3 · the trajectoryGrade the whole arc, not the last frame.#

The unit of evaluation for an agent is the : the full sequence of decisions, tool calls, results, and state deltas the agent produces between the user's message and the assistant's last reply. Anthropic's Demystifying Evals for AI Agentscalls it the transcript; LangSmith and most observability tools call it the trace; the three names mean the same thing. A trajectory eval reads it the way a code reviewer reads a pull request — step by step, looking for the move that doesn't fit.

Each step in a trajectory has four parts. The thought — what the model said it was going to do, if you ask it to think out loud. The — the function name and arguments the model emitted. The result — what the runtime returned. The state delta— what changed in the agent's working memory after the result came back. The widget below walks the six steps of the refund scene from §2; click through them and watch where step 2 goes wrong.

trajectory inspector6-step refund trace

The endpoint check passes. The trajectory check fails at step 2 — wrong record, persisted, no recovery. Everything downstream is contaminated, but every individual move after step 2 is locally reasonable: read the right policy, run the math, send a reply. That's the shape of a propagating failure. The wrong path that lands on the right answer is a bug waiting for a different prompt.

4 · the four checksWhat an agent eval suite is grading for.#

Most production agent suites end up running four kinds of check. They overlap; you skip the ones your task can't fail. None of them is enough on its own. The point is that they're asking different questions.

Check 1 — Outcome.

Did the agent end up in the right state? For the support assistant: did the refund actually process, was the ticket closed, did the reply name the right amount? Outcome checks are the cheapest. They're also what an endpoint eval was already doing.

Run them. They catch the dumb regressions — a prompt change that starts forgetting to call the close-ticket tool, a model swap that stops citing the policy. They're also where most teams start, and where most teams wrongly stop.

Check 2 — Trajectory.

Was each step a sensible move given the state? A trajectory check mixes two graders. A rule-based one (step 1 must call lookup_subscription before any other tool; the same subscription id must not appear as a user_id argument later). And an LLM-judge that reads the trace and asks did this sequence of decisions follow the policy doc?TAU-Bench takes the state-based variant of this check — it compares the database at the end of a conversation against an annotated goal state, so any trajectory that lands the right writes passes. BFCL v3's multi-turn evaluator is the canonical reference for the strict trajectory variant — it inspects which functions, in what order, with what arguments, against an expected sequence of calls.

The trade-off between the two graders is the same trade-off as unit tests vs LLM-as-judge from Article 1, one level up. Rule-based trajectory checks are cheap and unambiguous, but they lock you to one canonical path; the agent that solves the same task by a different valid order will fail. LLM-judge trajectory grading handles variance, but it inherits judge bias and needs the same calibration discipline as a single-output judge. Most production suites run rule-based checks on the constraints that must hold (no PII leak, no out-of-scope tool) and an LLM-judge on whether the path read as reasonable.

Check 3 — Tool use.

Was the right tool called with the right arguments? The Berkeley Function-Calling Leaderboard's methodology — AST-based exact-match — is the standard here. The agent's tool call is parsed into an abstract syntax tree; the expected call is parsed the same way; the trees are compared. A wrong argument fails the call even when the final answer happens to land correctly.

AST-match works for arguments that are structured — ids, enums, numbers, dates. For free-text arguments — a search query, a body of text, a tool input that's itself a paragraph — it's brittle, and you reach for an LLM-judge instead. Mixed graders, all the way down.

Check 4 — .

What does the agent do when a tool returns an error? Three possibilities, and only two are safe. It can retry (good). It can escalate or surface the failure to the user (good). It can ignore the error and confidently make up a result (bad). The third is the failure mode that costs you customers; it's also the one a casual outcome check will miss, because a confidently-confabulated reply still looks right.

A recovery probe is a small, deliberate set of prompts where the tool runtime is rigged to return an error. The eval checks whether the agent's next move is a retry, an escalation, or a hallucinated success. Anthropic's agent-eval cookbook treats this as a routine row in the suite, not a special pass. Treat tool errors as test inputs, not edge cases.

5 · cross-modelSame suite, three models, three trajectories.#

A eval is the same trajectory suite, run against more than one model backend, with the prompts and tool runtime held constant. Same tasks, same grader, swap the runner. Most observability platforms — LangSmith, Braintrust, Weave — let you wire this in a few lines once the suite exists.

The temptation is to read the leaderboard and pick the highest number. The temptation is wrong. The widget below runs the refund task against three models. Click through them and read the verdicts row at the bottom.

cross-modeltask: refund a pro subscription

Model α scores higher on outcome than β does, but β recovers from the rate-limited refund tool and α confabulates a refund confirmation that didn't happen. A pass-rate ranking would call α the best of the three; the trajectory eval shows γ is the one to ship. A higher pass rate on outcome is not the same as a better model.

Two practical notes. Cross-model evals are paired — same prompts, same expected trajectories, just swap the runner — so the right statistical test for the delta is McNemar's, not a two-sample comparison. Cross-model is also where pass^kstarts to matter: a model that succeeds 80% on a single trial may succeed only 33% across five back-to-back trials, and an agent that's flaky across runs is a different product than one that succeeds reliably.

6 · wire it inThe smallest trajectory grader you can ship.#

You don't need a framework or a vendor to start. You need a list of trajectories, a few rules, and an hour. The Python below is a sketch — three checks, one per dimension worth grading first. Translate the language to taste; the shape is the point.

The 20-to-50-tasks rule from Anthropic's agent-eval guidance is the same rule from Articles 2 and 7: the best eval prompts come from real failures, not from your imagination. The discipline scales; the unit changes. The unit, now, is the trajectory.

Three things this skeleton deliberately leaves out, that you add as your suite matures. First, an LLM-judge for the long-tail questions — was the reply on-policy?That uses the calibration recipe from Article 3, applied to trajectories instead of single replies. Second, paired statistics — when you start comparing models or prompt variants, the McNemar's test from Article 6 is the right tool for the delta. Third, the flywheel — the trajectories you grade today should keep growing from production traces, sampled and anonymised per Article 7. The agent suite is the same suite, with a different unit of measurement and a few more rules.

7 · the pointThe path is the product.#

The reason to grade trajectories isn't that endpoint grading is wrong. It's that endpoint grading was adequate when the model was a function — one input, one output, score the output. Once the model became a planner — once it could call tools, retrieve context, observe results, decide what to do next — the unit of evaluation moved up a level. The trajectory is the new function call. The path is the product.

Read the transcripts.