A primer on calibrating an LLM-as-judge

Your LLM-as-judge has a palate too.

may 3, 2026·11 min read

1 · what this is aboutThe judge is another model, not a measurement.#

An LLM-as-judge is the second model you point at the first model’s output and ask, was that any good?It’s the most useful tool in the eval kit. It is also the one most teams misuse first. Once you have a judge, the dashboard moves on its own. The temptation is to assume the dashboard is now telling you the truth.

It isn’t. The judge is a model. It picks up the same biases any model has — it prefers what it sees first, it prefers more words, it ranks its own family higher. What an LLM-as-judge tells you is a property of the judge and the rubric, not of the output alone. Two judges with the same rubric, or one judge with two rubrics, will return different verdicts on the same answer.

This article is for engineers and product folks who already have a working judge — or are about to write one — and want to know when to trust it. By the end you’ll have a working mental model of what a judge actually measures, the three biases worth knowing by name, and a Monday-morning recipe for calibrating your judge against a human grader on thirty rows.

2 · the failure modeWhat happens when the judge isn’t calibrated.#

Picture a support assistant. It triages tickets, looks up refunds, walks customers through the basic troubleshooting tree, escalates when it gets stuck. The team adds an LLM-as-judge to score the outputs against a one-line rubric: did the answer address the user’s question?

The pass-rate the next morning is 71 %. The team tightens the prompt. Pass-rate climbs to 79. They swap in a larger model. 84. They reword the rubric. 88. The dashboard is going up and to the right. The team ships.

A month later, complaints arrive. Customers are getting confidently wrong answers about cancellation windows. The assistant invents encryption claims it can’t back up. Tone is drifting cute on serious questions. None of this shows up on the dashboard, because the rubric only ever asked one question and the judge only ever answered it. The team didn’t move quality. They moved a number.

The fix isn’t a smarter judge. The fix is a calibrated one — and calibration is a procedure, not a prompt.

3 · the three biasesThree biases the judge has, by name.#

Before you can calibrate a judge, you have to know what it’s biased about. The literature is dense; three biases show up in every paper and every production post-mortem. Learn these names. They make the failure modes legible.

is the judge preferring whichever answer it sees first. On pairwise comparison — A or B? — the answer in slot A wins more often than 50 %. The fix is mechanical: run the comparison twice, swap the order, and only count agreements. Disagreements are where the bias lives; ties go to whichever answer kept its verdict across the swap.

is the judge preferring whichever answer is longer, regardless of whether the extra length carried more information. This one is hard to defeat by prompt engineering — the judge has been trained on data where longer answers are usually better. The working fix is to bound length in the rubric itself: under 80 words, no fluff sentences, scored as a separate yes/no axis.

is the judge ranking its own model family higher than rivals. If you ask GPT-4 to grade GPT-4 against Claude, GPT-4 wins about a quarter more often than a human grader would call it. The fix is the simplest of the three — don’t use a model from the same family as the judge that scores it. If you have to, swap in a second-family judge for calibration runs and treat the divergence as the bias estimate.

None of these biases are a reason not to use an LLM-as-judge. They are reasons to make the judge prove itself before you trust the dashboard. The proof is calibration, and the mechanical part of calibration is the rubric. The rubric is the judge. The model is the executor. Two rubrics, one model — different answers.

judge · same outputs, two rubrics4/5 pass

Five outputs, two rubrics, three flips. The model is fixed; the outputs are fixed; the verdict isn’t. The lenient rubric asks one question (did it address the user?) and finds 80 % pass. The strict rubric asks four (factual? on-voice? in-scope? sourced?) and finds 40 %. Both numbers are real. Neither, on its own, is a measurement.

What turns either of those numbers into a measurement is the step the lenient-rubric team skipped: comparing the judge’s pass/fail against a human grader on the same rows. The gap between the two — the — is the only number that tells you whether the dashboard reflects reality.

4 · the recipeHow to calibrate the judge, on Monday.#

The recipe below is the smallest version that works. It costs about an hour of your time and an hour of one domain expert’s. It produces a number — the calibration delta — that you can put on the dashboard next to the pass-rate, and that anyone in the room can argue about.

Note —

Sample 30 outputs the judge has scored. Stratify if you can — half the judge called pass, half fail. Random within each stratum. Thirty rows is small enough to grade by hand and big enough to see the bias.
Strip the metadata; blind-grade. One spreadsheet, two columns: prompt and output. No judge score, no model name, no row order matching the original. The grader is one domain expert — the person whose taste you want the judge to inherit.
Score on a binary axis. Rubric first, score after. Factual? On-voice? In-scope? Sourced?Three to four axes, yes or no on each. Not 1–5. Not “quality.” Pick axes that map to your failure taxonomy from error analysis.
Compute the delta.Per axis, count rows where judge and human disagree. Divide by 30. A working floor we’ve seen across calibrated judges is 10–15 % — anything higher and the rubric or the judge prompt needs work before that axis is on the dashboard. Tighten the floor as you grow the sample.
Re-run weekly.The judge doesn’t drift; your product does. New features land, new failure modes appear, and a rubric calibrated last quarter no longer covers them. Thirty rows a week is the cheapest moving-window check you can run.

One warning. If you ever notice the calibration delta is improving every week without anyone changing the rubric, the rubric has stopped being hard. It’s drifting toward what the judge is good at — which is, mechanically, what the rubric rewards. Add cases the judge gets wrong. A useful rubric is one that occasionally tells your judge it’s wrong.

5 · the pointThe point.#

A judge prompt isn’t written. It’s fitted— to one rubric, against one human grader, on a held-out sample of thirty rows. Anything else is theatre. You can ship a judge in fifteen minutes; you cannot trust one until you’ve calibrated it. The dashboard between those two states looks identical. The dashboard is only a measurement after the calibration step. Before that, it’s a number-generator that happens to have an opinion.

Open the spreadsheet next to the judge prompt.