LLM evaluation is the discipline of measuring whether a model-powered system actually does its job — before you pick a model, before every release, and continuously in production. Every mature program reduces to three layers: offline capability evals that select the model, regression evals that gate each change, and online monitoring that scores real traffic. The measurement itself has gotten cheap. An automated judge call runs about $0.002 against $0.50 to $2.00 for a careful human judgment, so budget is no longer the constraint. The constraint is the program: who owns the eval dataset, which numbers block a release, and what happens when they move.

Most writing about LLM evaluation gets this backwards: long metric catalogs, near silence on the program. We have watched teams build beautiful dashboards that changed nothing, because no number on them ever blocked a deploy.

This guide is the map, not the territory. We publish deep dives on judge calibration, agent trajectory testing, RAG evaluation, and production observability; this page connects them — one taxonomy, one table mapping system type to metrics, and the organizational spine that metric catalogs skip.

Why Most LLM Evaluation Programs Fail

Ask a team to show their evaluation and you get a dashboard. Ask which number blocks a release and you get a pause.

That pause is the diagnosis.

We call the underlying failure the metrics zoo: twelve metrics, tastefully charted, none wired to a decision. The release call is still made the old way — a product manager eyeballing five outputs at 6pm on deploy day.

The second failure is quieter. The eval set — inputs, expected behavior, labels — lives in a spreadsheet on one engineer's laptop, unversioned, untied to any model upgrade it was tested against. When that engineer changes teams, the company's memory of what "good" means walks out the door with them.

Both failures share a root: treating evaluation as a tooling problem when it is an asset problem. Tools churn. In 2026 alone, OpenAI acquired promptfoo in March and deprecated its own Evals platform in June (full shutdown November 30, 2026), while Stanford's HELM framework entered maintenance mode. Your labeled dataset is different: a few hundred cases that encode what your product must do, labeled by your own experts and versioned like code, outlive every framework migration. The dataset is the program. Everything else is plumbing.

The Three Layers of LLM Evaluation

Every evaluation activity answers one of three questions, and mixing them up is how teams end up running benchmarks when they need a regression test.

Diagram of the three layers of LLM evaluation: offline capability evals answer whether a model can do the task at all and run at model selection; regression evals answer whether a change made the product worse and run in CI on every prompt, model, or config change; online monitoring answers whether quality is holding on real traffic and runs continuously on sampled production traces.
Diagram of the three layers of LLM evaluation: offline capability evals answer whether a model can do the task at all and run at model selection; regression evals answer whether a change made the product worse and run in CI on every prompt, model, or config change; online monitoring answers whether quality is holding on real traffic and runs continuously on sampled production traces.

LayerThe question it answersWhen it runsTypical method
Offline capability evalsCan this model do the task at all?Model selection, major version upgradesPublic benchmarks plus a task-specific accuracy set
Regression evalsDid this change make the product worse?Every prompt, model, or config change — in CIFrozen, versioned eval set scored against the production baseline
Online monitoringIs quality holding on real traffic?Continuously, on sampled production tracesCalibrated judges plus user signals

Offline capability evals: choosing the horse

Capability evals compare models in general — the world of public benchmarks. EleutherAI's lm-evaluation-harness is the standard open-source runner; suites like HELM aimed for breadth. Their one job: shortlisting a model family.

Two warnings. Public benchmarks leak into training data, so scores inflate. And your domain is not on the test — a 0.92 on a public QA benchmark says nothing about your refund policy or contract corpus. Anthropic's guidance on building evals is blunt: define specific, measurable success criteria for your task before testing anything. Capability evals shortlist the model. Your own set decides.

Regression evals: guarding the barn

The layer most teams are missing, and the one with the highest return. A regression eval runs a frozen, versioned set of cases — real inputs, machine-checkable success conditions where possible — on every change: prompt edits, model bumps, retrieval config, tool schemas. All of it. LLM systems have no clean line between code and config, so the gate trips on both.

Gate against the current production baseline, not an absolute floor. "Is v9 worse than v8 on the frozen set" is answerable and useful. "Is v9 above 0.8" is a number someone made up in a meeting.

Online monitoring: reality's vote

No curated set anticipates what real users type. Online monitoring runs a sampled subset of your evaluators against live traces continuously, plus the signals only production generates — retries, escalations, thumbs-down. More on it below.

What to Measure, by System Type

The metrics that matter depend on what you built. A summarizer, a RAG pipeline, and a tool-calling agent fail in different ways. The table is the dispatch layer; each row has a full guide behind it.

System typeCore questionPrimary metricsScoring methodDeep dive
Single generation (summarize, classify, draft)Is the output correct, faithful, on-policy?Task accuracy, faithfulness, policy-violation rateDeterministic checks plus a calibrated judgeLLM-as-a-judge
RAG pipelineDid retrieval fetch the right chunks? Is the answer grounded in them?Recall@k, precision@k, faithfulness, context recallIR metrics on a labeled gold set; judge for groundednessRAG evaluation
AgentRight outcome, sane path, sane cost?Task success rate, trajectory correctness, cost per task, p95 latencyEnd-state assertions, trace rules, judge for subjective dimensionsAgent evaluation

Single-generation systems

The simplest case, and where you learn the scoring hierarchy that applies everywhere: deterministic checks first, judges second. If a verifiable answer exists — an exact value, a schema, a label — assert on it programmatically; deterministic checks are cheap, stable, and impossible to sweet-talk. Save the judge for the genuinely subjective dimensions.

RAG pipelines: split the scorecard

Most RAG failures are retrieval failures, so grade retrieval and generation separately. Retrieval is a pure information-retrieval problem — recall@k and precision@k against a labeled gold set, no LLM required, milliseconds to run. Generation is graded conditioned on what was retrieved; if recall is low, no prompt engineering will save you. A focused gold set of 150 to 300 real queries with careful labels beats thousands of synthetic ones. Metrics, gold-set construction, and component ablations live in the enterprise RAG evaluation guide, with architecture context in the enterprise RAG pillar.

Agents: score the trajectory, not just the answer

An agent does not produce one answer — it produces a sequence of tool calls and decisions, and outcome and path are separate axes. An agent can reach the right answer by a wrong path. Agents are also non-deterministic, so a single pass/fail run is noise; run each case several times and report rates. And error compounds: a step that is 90% reliable, chained ten deep, drops end-to-end success below 35%. Per-step accuracy is a trap. Trajectory scoring, sandboxed tools, and replay testing live in the agent evaluation guide, with wider context in the agentic AI pillar.

Judges and the Calibration Bar

You cannot label faithfulness at scale by hand, so every serious pipeline ends up with an LLM evaluator scoring outputs against a rubric. Two rules before you hand any decision to one.

Rule one: binary rubrics. "Rate helpfulness 1 to 10" produces noise — five humans give the same answer five different scores. A checklist of concrete yes/no questions produces signal.

Rule two: the calibration bar is human agreement, and the bar has a number. In the MT-Bench study, a strong judge agreed with human preferences over 80% of the time — while two humans agreed with each other about 81% of the time on the same task. Within the noise, judge and human were equally consistent. That is the standard: measure agreement on a few hundred human-labeled cases before a judge's scores gate anything.

Judges also carry three documented biases — position, verbosity, and self-preference. The LLM-as-a-judge guide owns the mitigations and the full calibration workflow. The one-sentence version: an uncalibrated judge is not a measurement, it is decoration.

Wiring LLM Evaluation into the Release Path

An eval suite that cannot fail a build is a dashboard with extra steps.

Diagram of the evaluation release loop: a prompt or model change passes a CI regression gate scored against the frozen eval set, ships through canary to production, where sampled judge scores and judge-human disagreements are harvested back into the next tagged version of the eval set — the dataset is the versioned asset and the tools around it are replaceable.
Diagram of the evaluation release loop: a prompt or model change passes a CI regression gate scored against the frozen eval set, ships through canary to production, where sampled judge scores and judge-human disagreements are harvested back into the next tagged version of the eval set — the dataset is the versioned asset and the tools around it are replaceable.

The gate design

  • Trigger on every change — prompts, model versions, retrieval config, tool schemas. Config is code here.
  • Tier the suite. Cheap deterministic checks on every commit; the full judged run on pull request and release.
  • Gate on a basket, not a blended score. Task success at or above baseline, no faithfulness regression beyond tolerance, p95 latency and cost-per-task ceilings, zero policy violations. One averaged number hides exactly the regression you most need to see.
  • Account for non-determinism. Repeat runs and gate on rates, or engineers learn to ignore a flaky gate.

The arithmetic is friendlier than teams expect. A 500-case regression suite scored on six judged dimensions is 3,000 judge calls — about $6 per full run at $0.002 a call. Run it on every pull request for a month and you still have not spent what one incident review costs.

The tooling

The current toolbox, verified as of July 2026: promptfoo runs declarative YAML evals with a GitHub Action that posts pass/fail comments on the pull request (now owned by OpenAI, still open source). DeepEval gives you pytest-style assertions so LLM evaluations sit next to your unit tests. Langfuse ties datasets and experiments to production traces. Pick any and hold it loosely — the dataset is portable; the framework is not the asset.

The wiring with the highest payoff after CI is version-linked comparison on real traffic: every generation records the prompt version that produced it, so you can compare score distributions for v7 and v8 on production inputs. The mechanics live in the prompt versioning guide.

Online Monitoring and Drift

Offline evals only test the questions you thought to ask. Production is where the other questions live.

Online monitoring runs your calibrated judges against a sample of live traces, every score linked to the prompt version and model that produced the output. That link turns a score drop into a diagnosis — the failing conversations are one query away. Drift arrives from three directions: silently updated models, shifting traffic, and a corpus growing past what the eval set covers.

The judge-human disagreements are the gold. Route a small weekly sample to human review; every disagreement is a candidate case for the next eval-set version. Instrumentation lives in the LLM observability guide; for multi-step agent traces, see the agent observability glossary entry.

The Program That Sticks

Everything above is mechanics. Programs die for organizational reasons, so an LLM evaluation program needs a spine: an owner, a versioned asset, and a cadence.

One named owner

Not a committee, not "the team." One staff engineer or applied scientist owns the eval dataset, the gate thresholds, and the calendar. Committees produce metrics zoos. Owners produce release gates.

The eval set is a versioned asset

Tag it, freeze it, changelog it, review label changes the way you review code. Every production incident becomes an eval case within a week of the postmortem. Refresh a slice quarterly from live traffic. The expensive part is not compute — it is the two or three days of expert labeling the first few hundred cases cost: the highest-return line in the entire GenAI budget, and the step teams most reliably refuse to fund.

Cadence and the budget line

Recalibrate the judge quarterly — the model behind it will change whether you asked for it or not. Triage judge-human disagreements weekly. Put eval tokens on the budget as a named line item: the number is small, but invisible costs are the first ones cut.

Standing the whole thing up, in order:

  1. Write the decision rules first. Which numbers gate a release, at what thresholds, agreed before any tooling is chosen. One page.
  2. Build the first eval set from production traffic. A few hundred real cases, over-sampled from the hard tail, each with a machine-checkable success condition where one exists.
  3. Version it like code. Tag v1, freeze it, changelog every addition. This is the baseline all future changes are judged against.
  4. Automate the scoring. Deterministic checks first; one calibrated judge for the subjective dimension that hurts most, validated against human labels — 80% agreement is the bar.
  5. Put it in CI and let it fail builds. Deterministic subset on commit, full judged run on pull request and release, thresholds compared to the production baseline.
  6. Close the loop from production. Sample live traffic with the same judges, triage disagreements weekly, fold incidents into the next dataset version, recalibrate quarterly.

Six steps, one owner, one versioned dataset. When you need a level deeper, the four spokes — judges, agents, RAG, and monitoring — pick up where this map leaves off.

Frequently Asked Questions

What is LLM evaluation?

LLM evaluation is the practice of measuring whether an LLM-powered system meets defined quality bars, organized in three layers: offline capability evals that compare models, regression evals that gate every release against a frozen test set, and online monitoring that scores sampled production traffic. Mature programs combine deterministic checks, calibrated LLM judges, and targeted human review in one loop.

How is LLM evaluation different from LLM benchmarks?

Benchmarks — MMLU-style suites, HELM, the lm-evaluation-harness task library — are the offline capability layer: standardized model comparisons. They shortlist a model, but public scores suffer from training-data contamination and rarely resemble your domain. The regression set you build from your own traffic is the instrument that actually gates a release.

How many test cases does an LLM evaluation set need?

A few hundred carefully labeled real cases is enough to start, and beats thousands of sloppy synthetic ones. RAG gold sets follow the same rule — roughly 150 to 300 stratified queries with labeled relevant chunks. Grow the set by folding in production incidents and judge-human disagreements, not by bulk generation.

Can LLM evaluation be fully automated?

Mostly, not entirely. Deterministic checks and calibrated judges can score every release and a sample of all production traffic without a human in the loop. Humans stay in three places: labeling the ground-truth set, recalibrating the judge quarterly, and reviewing the high-stakes slice where a judge is a triage filter rather than a verdict.

What metrics should gate an LLM release?

A basket, never one number: task success at or above the production baseline on the frozen set, no faithfulness regression beyond tolerance, p95 latency and cost-per-task ceilings, and zero policy violations. Gate per stratum as well, so a regression on hard cases cannot hide behind an easy-case average.

How is evaluating agents different from evaluating a single LLM call?

A single call is scored as one input-output pair against a reference or rubric. An agent is scored as a trajectory — the sequence of tool calls and decisions — where outcome and process are separate axes, runs are non-deterministic, and tool side-effects require sandboxed environments. That adds metrics a single call never needs, like trajectory correctness and cost per task.

References

  1. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." https://arxiv.org/abs/2306.05685
  2. Anthropic — "Define success criteria and build evaluations." https://docs.anthropic.com/en/docs/test-and-evaluate/develop-tests
  3. promptfoo — declarative LLM evaluation with CI/CD integration. https://github.com/promptfoo/promptfoo
  4. Ragas — evaluation framework for retrieval-augmented generation pipelines. https://github.com/explodinggradients/ragas
  5. EleutherAI — lm-evaluation-harness, the standard open-source benchmark runner. https://github.com/EleutherAI/lm-evaluation-harness
  6. DeepEval — pytest-style LLM evaluation framework with CI support. https://github.com/confident-ai/deepeval