LLM-as-a-Judge: Automated Evaluation That Works

LLM-as-a-judge is a method for scoring model output with another model instead of a human reviewer. You hand a judge model the input, the response, and a rubric, and it returns a score. That is the whole idea. The reason it matters is arithmetic. A skilled human reviewer reads maybe 30 to 40 transcripts in 60 minutes, at a loaded cost of $0.50 to $2.00 per careful judgment. A judge call on a small model costs around $0.002, returns in under 2 seconds, and finishes 16 of them before the human has read 1. At 50,000 conversations a week, one of those numbers is a budget line and the other is a rounding error.

So teams reach for it, wire up a prompt that says "rate this answer 1 to 10," point it at production traffic, and trust the dashboard. Then a quarter later someone notices the scores never move, or they move for reasons nobody can explain. The judge was real. The trust was not. This is a guide to closing that gap — what an LLM judge actually measures, the three biases that quietly corrupt it, and how to calibrate one until its numbers mean something.

We have spent the better part of two years building and breaking these pipelines, and almost every failure traces back to the same root: a judge deployed before anyone checked whether it agreed with a human.

Why Human Review Stopped Being an Option

Start with the thing the judge replaces. Manual review is the gold standard and always will be. A domain expert reading a transcript catches nuance no rubric encodes. The problem is throughput.

A reviewer working carefully gets through maybe 30 to 60 transcripts an hour. A mid-size support deployment generates 8,000 to 20,000 conversations a day. The math doesn't close. You would need a team of 40 people doing nothing but reading chat logs, and by the time they finished Monday's traffic, Thursday's would be three days stale. Quality signal that arrives four days late is not a quality signal. It's a post-mortem.

Sampling helps but lies by omission. Review 1% and you are blind to the failure that lives in the other 99% — the one specific intent, the one customer segment, the one prompt variant that quietly regressed. The whole reason you instrument LLM observability is to see every turn. Evaluating 1% of those turns throws that visibility away the moment you need it most.

LLM-as-a-judge is the only thing that runs at the same scale as the traffic. That is its entire claim to existence. It is not better than a human. It is present when a human cannot be.

What an LLM Judge Is Actually Made Of

A judge is three things, and getting any one of them wrong quietly poisons the output.

Diagram of an LLM judge: user input, model response, and a yes/no rubric feed a schema-bound judge model, which returns a typed score — numeric, categorical, or boolean — that is written back to the trace store linked to the prompt version and model under test.

The judge model

The model doing the scoring. It does not have to be the biggest model you own — for most rubric checks a mid-tier model is plenty — but it must reliably return structured output. If your judge sometimes answers "I'd rate this around a 7, though it depends" instead of {"score": 7}, your pipeline breaks on parse errors at 3am. Pick a model with a real structured-output or function-calling mode and constrain it to a schema.

The criteria

The rubric. This is where 90% of bad judges go wrong. "Rate the helpfulness from 1 to 10" feels specific and is almost meaningless — ask five people to score the same answer on that scale and you will get five different numbers, with a spread of three points or more. A judge inherits that ambiguity and adds noise on top.

The fix is to make criteria binary and concrete. Not "rate helpfulness 1-10" but "Does the response answer every part of the question? yes/no. Does it invent any fact not present in the retrieved context? yes/no. Does it tell the user to do something unsafe? yes/no." Each question has one defensible answer. A judge scoring a checklist of yes/no questions is dramatically more consistent than the same judge guessing at a 10-point scale.

The score type

What comes back. Three shapes cover almost everything:

Numeric — a bounded number, usually 0 to 1, for graded dimensions like faithfulness or relevance.
Categorical — a label from a fixed set, like correct / partially_correct / wrong.
Boolean — a yes/no for a policy check, like "contains PII" or "refused appropriately."

Type your scores explicitly and store them next to the trace that produced them. The moment a quality metric drops, you want to pull the exact conversations that caused it, filtered by score, linked to the prompt version and model that produced them.

The Three Biases That Will Burn You

Here is the part the demos skip. LLM judges have systematic, measurable biases. They were documented carefully in the 2023 MT-Bench work by Zheng and colleagues, and they have not gone away because the architecture that causes them has not changed. If you do not control for these, your scores are confidently wrong.

Position bias

Show a judge two responses and ask which is better, and it favors the one you showed first — purely because it was first. In the MT-Bench study the effect was large enough that swapping the order flipped the verdict a meaningful fraction of the time. The control is brutal but simple: run every pairwise comparison twice, A-then-B and B-then-A, and only count it as a win if the same answer wins both orderings. Disagreements get marked a tie. You pay double the judge calls and you get a number you can defend.

Verbosity bias

Judges like long answers. Given two responses of equal correctness, the wordier one tends to score higher, even when the extra words add nothing — one analysis found a strong, consistent lean toward length regardless of quality. If your judge rubric does not explicitly say "do not reward length; a complete answer in two sentences beats a padded one in eight," you are training your product to ramble.

Self-enhancement bias

A judge tends to prefer text written by itself or by models in its own family. GPT-4 judging GPT-4 grades a few points high; the same pattern shows up across model families. The implication is uncomfortable: if you evaluate your GPT-powered product with a GPT judge, you have a conflict of interest baked into the measurement. Where it matters, cross-judge — score with a model from a different family than the one you ship.

Pointwise, Pairwise, or Reference-Based

There are three ways to ask a judge a question, and they answer different questions.

Mode	What you ask	Best for	The catch
Pointwise	"Score this one answer, 0–1"	Production monitoring at scale	Absolute scores drift; a 0.7 today and a 0.7 next month may not mean the same thing
Pairwise	"Which of these two is better?"	Comparing a new prompt or model against the current one	Costs 2× from order-swapping; gives you a preference, not an absolute
Reference-based	"How close is this to the known-good answer?"	Cases with a ground-truth answer	You need the references, which is the expensive part

Most mature pipelines run pointwise judges in production for continuous monitoring, and switch to pairwise when they are deciding whether a change actually improved things. Reference-based grading is reserved for the curated eval set you freeze and re-run on every release.

Calibrating a Judge You Can Trust

A judge you have not calibrated is a random number generator with good manners. Calibration is the work that turns it into a measurement. It is not glamorous, and it is the step everyone skips.

The anchor is human agreement. The MT-Bench result that made this whole field credible was a specific one: a strong judge agreed with human preferences over 80% of the time — and critically, two humans only agreed with each other about 81% of the time on the same task. The judge was, within the noise, as consistent with people as people were with each other. That is the bar. You do not need a judge that matches a Platonic ideal of correctness. You need one that agrees with your experts about as often as your experts agree with one another.

To get there:

Have domain reviewers label a few hundred real examples by hand. This is your ground truth. It is the most valuable artifact in the whole pipeline and the one teams refuse to spend three days building.
Run your candidate judge against those same examples.
Measure agreement. If the judge matches your humans on 80%+ of cases, ship it. If it sits at 60%, your rubric is ambiguous or your model is too weak — fix the rubric first, it is almost always the rubric.
Re-check quarterly. Models get silently updated, traffic patterns shift, and a judge that was calibrated in March can drift by September.

Teams with years of experience running these pipelines treat the human-labeled set as a permanent asset, version it like code, and grow it every time the judge and a human disagree in production. The disagreements are the gold. Each one is a case your rubric did not anticipate.

How to Build It: Six Steps

Start with one evaluator on your highest-traffic check. Do not try to score twelve dimensions on day one. Pick the single failure mode that hurts most — usually faithfulness or task completion — and build one judge for it.
Write the rubric as yes/no questions. Replace every 1-10 scale with a checklist of binary, concrete criteria. Test the rubric on ten hand-picked hard cases before it touches production.
Constrain the output to a schema. Force structured output. No free text. Parse failures should be impossible by construction, not handled after the fact.
Validate against a live preview before deploy. Run the judge against real historical traces and read the results yourself. You are looking for cases where the judge's score and your gut disagree — those are rubric bugs.
Control for the biases. Order-swap every pairwise call. Tell the rubric not to reward length. Use a cross-family judge where self-preference would matter.
Close the loop with the trace store. Every judge run should write its own trace, scored at the observation level and linked back to the prompt version and model under test, so a drop in quality is one query away from the conversations that caused it.

Where Judges Still Lie

Be honest about the ceiling. An LLM judge is excellent at scale-able, rubric-shaped questions: is this faithful to the source, is it on-topic, did it refuse when it should have. It is weak exactly where humans are strong — genuine novelty, subtle domain expertise, anything where the "correct" answer is contested by experts.

For high-stakes domains — medical, legal, financial advice — a judge is a triage filter, not a verdict. Use it to surface the 3% of conversations that need a human and route those to a review queue, where the scores land in the same data model as the automated ones. The goal was never to remove the human. It was to make sure the human looks at the right 3%.

Frequently Asked Questions

What is LLM-as-a-judge?

LLM-as-a-judge is an evaluation technique where one language model scores the output of another against a defined rubric, returning a numeric, categorical, or boolean score. It runs asynchronously at production scale — thousands of traces per minute at roughly $0.002 per call — which is what makes continuous quality monitoring affordable where human review is not.

Is LLM-as-a-judge accurate?

A well-calibrated judge agrees with human raters over 80% of the time on rubric-shaped tasks — about the same rate two humans agree with each other. Accuracy depends almost entirely on the rubric: binary, concrete yes/no criteria produce reliable judges, while open-ended 1-10 scales produce noise. It is reliable for faithfulness, relevance, and policy checks, and weak for contested or highly specialized judgments.

What are the main biases in LLM judges?

Three are well documented: position bias (favoring whichever response is shown first), verbosity bias (favoring longer answers regardless of quality), and self-enhancement bias (favoring text from the judge's own model family). Each is controllable — order-swapping, explicit length-neutral rubrics, and cross-family judging respectively.

How is LLM-as-a-judge different from agent evaluation?

LLM-as-a-judge scores a single generation — one answer, one summary, one classification. Agent evaluation scores a multi-step trajectory: the sequence of tool calls, retrievals, and decisions an agent makes to reach an outcome. The judge is one scoring tool used inside the larger agent-evaluation harness, not a replacement for it.

Should I use the same model as judge and as the product?

Avoid it where the score carries weight. Self-enhancement bias means a model grades its own family a few points high, which is a conflict of interest in your measurement. Use a judge from a different model family than the one you ship, or at minimum validate that the two agree on your human-labeled set.

How many human labels do I need to calibrate a judge?

A few hundred carefully labeled real examples is enough to measure agreement reliably for one dimension. Quality matters far more than quantity: 300 examples labeled by a domain expert beat 3,000 labeled by a rushed contractor. Grow the set over time by adding every case where the judge and a human disagree in production.

References

Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (the foundational study of judge agreement and bias). https://arxiv.org/abs/2306.05685
Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." https://arxiv.org/abs/2303.16634
Ragas — evaluation framework for retrieval-augmented generation, including model-graded faithfulness. https://github.com/explodinggradients/ragas
OpenAI Evals — framework for evaluating LLMs and LLM systems. https://github.com/openai/evals
Anthropic — guidance on building and grading evaluations for production systems. https://www.anthropic.com/research