AI agent evaluation is harder than model evaluation because an agent does not produce one answer you can score against a reference — it takes a sequence of actions, calls tools, reads their results, and decides what to do next. Where you grade a language model on a single input-output pair, you have to grade an agent on an entire trajectory that branches differently every run, mutates external state through tool side-effects, and can reach the right answer by a wrong path (or the wrong answer by a plausible one).
That distinction is the whole reason a separate discipline of agent testing exists. A 0.92 on some QA benchmark tells you nothing about whether your agent will refund the wrong customer, loop forever on a flaky API, or burn forty tool calls on a three-call job. This article is a practitioner's guide to an AI agent evaluation harness that catches those failures before production: what to measure, how to build the eval set, how offline and online testing fit together, where LLM-as-a-judge helps and where it lies, and how to gate all of it in CI.
Why Agents Are Hard to Evaluate
Three properties of agentic systems break the assumptions that traditional ML evaluation relies on.
Non-determinism. Even at temperature zero, agents are not reproducible end to end. Tool results change (a search returns different documents today), upstream models get silently updated, and the slightest perturbation in retrieved context fans out into a different plan. A single pass/fail run is statistical noise. You need to run each eval case n times and report distributions — success rate, p50/p95 latency, cost variance — not point estimates.
Multi-step trajectories. The unit of evaluation is a path, not a label. An agent that books the right flight after seven redundant searches and two backtracks is "correct" on outcome but broken on efficiency and trajectory. Conversely, an agent can emit a perfectly reasonable-looking final message while having skipped the verification step your policy requires. Outcome scoring and process scoring are different axes and you need both.
Tool side-effects. Agents act. They write to databases, send emails, charge cards, open tickets. This makes evaluation dangerous (you cannot freely run a refund agent against production Stripe) and stateful (the same test gives different results depending on what the previous step did). Serious agent testing therefore requires sandboxed or mocked tools, deterministic fixtures, and the ability to assert on attempted actions, not just realized ones.
A fourth, quieter problem: error compounding. A 90%-reliable step looks fine alone, but chain ten of them and end-to-end success can fall below 35%. Per-step accuracy is seductive and misleading; you must measure the full chain. Treating the agent as a black box — as covered in the agentic AI pillar — only works if your harness can also crack it open and inspect the steps.
The Dimensions Worth Measuring
"Did it work?" is necessary but nowhere near sufficient. A production-grade agent evaluation scores at least seven dimensions; skip any one and a regression in it ships silently.
- Task success / goal completion — did the agent achieve the user's actual goal, measured against a verifiable end-state (the row exists, the ticket is closed, the answer matches the gold value) rather than whether the agent claims success?
- Trajectory & tool-call correctness — did it call the right tools, with valid arguments, in a sensible order? Catches the case where the answer is right but the process is unsafe, wasteful, or non-reproducible.
- Faithfulness / grounding — are claims supported by tool outputs and retrieved context rather than hallucinated? This is the bridge to RAG evaluation (see the enterprise RAG pillar), where frameworks like Ragas plug in.
- Cost per task — total tokens and tool invocations per job. The most under-measured agent metric, and the one that quietly destroys unit economics at scale.
- Latency — end-to-end and per-step. Report p95, not the mean; agent tail latency is brutal because one slow tool or one extra reasoning loop dominates.
- Safety / policy adherence — did it stay inside guardrails: no out-of-policy actions, no PII leakage, refusal where required, no destructive tool calls without confirmation?
- Robustness — does it degrade gracefully under adversarial inputs, malformed tool responses, ambiguous instructions, and prompt-injection embedded in tool output?
Here is how each dimension maps to a practical method and the raw signal you collect:
| Dimension | Primary method | Signal |
|---|---|---|
| Task success | Programmatic end-state assertion | Pass/fail vs. gold final state |
| Trajectory correctness | Tool-call trace matching / sequence rules | Expected vs. actual tool calls + args |
| Faithfulness | LLM-as-judge + grounding checks (Ragas-style) | Claim-to-source attribution score |
| Cost per task | Telemetry aggregation | Tokens + tool-call count per task |
| Latency | Span timing | p50 / p95 end-to-end + per step |
| Safety / policy | Rule checks + red-team suites | Policy-violation count, refusal rate |
| Robustness | Perturbation & injection suites | Success delta under perturbation |
The principle: prefer programmatic, deterministic checks wherever a verifiable end-state or trace rule exists, and reserve judged evaluation for genuinely subjective dimensions (faithfulness, helpfulness, tone). Cheaper, more stable, less gameable.
Building a Gold-Standard Eval Set and Rubric
Your harness is only as good as the cases it runs. A serious AI agent evaluation set is a curated, version-controlled artifact, not a handful of demo prompts.
- Mine real traffic. The best cases come from production logs and support transcripts. Sample broadly, then over-sample the hard tail: ambiguous requests, multi-constraint tasks, the inputs that already caused incidents.
- Define the end-state, not just the answer. Specify a machine-checkable success condition per case — the exact value, the DB state, the allowed tool sequences. "Looks good" is not a success condition.
- Stratify by difficulty and category. Tag cases (simple lookup, multi-hop, requires-clarification, adversarial) and track success per stratum, so a regression in hard cases is not masked by easy ones.
- Keep a frozen regression set and a rotating fresh set. The frozen set guards against regressions; the rotating set guards against overfitting to the frozen one.
- Include negative and red-team cases. Inputs where the correct behavior is to refuse, escalate, or clarify — and prompt-injection payloads where the correct behavior is to ignore the injected instruction.
The rubric is the second half. For judged dimensions, write a rubric with concrete, near-binary criteria rather than a vague 1–10 scale. "Score helpfulness 1–10" produces noise; "Did the response (a) directly address the asked question, (b) cite a supporting source, (c) avoid unsupported claims — yes/no each" produces signal. Binary and additive checklist criteria correlate far better between a human rater and an automated judge than holistic scores do.
Anchor the rubric against humans. Have domain experts label a few hundred cases as ground truth, and continuously measure how well your automated scorer agrees with those labels. Our own GenAI assessment framework uses exactly this loop: humans set the bar, automation scales it, and the two are reconciled on a sample every cycle.
Offline vs. Online Evaluation
You need both. They answer different questions and catch different failures.
Offline: regression suites and replay
Offline evaluation runs your eval set against a candidate agent build in a sandbox before release. Two complementary modes:
- Regression suites — the full curated set executed against mocked/sandboxed tools, scored on every dimension, run on every change to prompt, model, tool schema, or orchestration logic. This is your safety net and your CI gate.
- Replay (trajectory replay) — re-run recorded production traces (cached tool responses) against a new agent version to see whether decisions changed, holding the environment fixed. Replay isolates the agent's reasoning from environmental noise, which is the only way to get a near-deterministic diff between two versions.
Offline is fast, cheap, safe, and reproducible. Its limit: mocks are not reality — real tools fail in ways your fixtures do not, and real users phrase things your set never anticipated.
Online: shadow, canary, A/B
Online evaluation tests against live (or live-mirrored) traffic.
- Shadow mode — the new agent runs on real inputs in parallel with the incumbent; its outputs are logged and scored but never shown to users and its tool writes are stubbed. Zero user risk, real-traffic signal.
- Canary — route a small percentage of real traffic to the new version, watch the dimension metrics and guardrail violations, and auto-roll-back on a threshold breach.
- A/B / interleaving — split traffic to compare versions on outcome metrics and online proxies (task completion, escalation rate, thumbs-up, retry rate).
The honest rule: offline catches regressions, online catches reality. Ship nothing on offline scores alone, and never run a blind A/B without an offline gate in front of it.
LLM-as-a-Judge: When It Works and Where It Lies
Using a strong LLM to grade agent outputs ("LLM-as-a-judge," popularized by the MT-Bench / Chatbot Arena work, arXiv:2306.05685) is indispensable for subjective dimensions at scale. It is also full of well-documented failure modes you must actively control for.
Known biases:
- Position bias — judges favor whichever response is shown first. Mitigate by swapping order and averaging both directions.
- Verbosity / length bias — longer answers are scored higher regardless of quality.
- Self-preference / self-enhancement bias — a judge tends to prefer outputs from its own model family. Use a different model family for the judge than for the agent where you can.
- Sycophancy and leniency — judges over-agree and grade generously, compressing the score range.
How to make a judge trustworthy: give it a tight rubric with binary criteria; require a short reason before the verdict (reasoning-then-score beats score-only); provide a reference answer when one exists (reference-guided grading is far more reliable); calibrate against human labels and report the agreement rate; and use judges for relative comparisons over absolute scores. Where a deterministic check exists, use it instead. Treat the judge as a noisy instrument you keep calibrated — not an oracle.
Component vs. End-to-End Evaluation
Evaluate at two altitudes, because each hides the other's failures.
Component (unit) evals isolate one part: the planner's decomposition, a single tool's argument formatting, the retriever's recall, the router's classification, the final summarizer's faithfulness. Fast, cheap, and they pinpoint where a regression lives. When end-to-end success drops, component evals tell you which link in the chain broke.
End-to-end (integration) evals run the whole agent on a real task and score the outcome and trajectory. They are the only thing that catches emergent failures — error compounding, plan-level mistakes, tool-interaction bugs — that every component passing in isolation will miss.
Run both. Component evals in CI on every commit for speed; end-to-end evals on every release and nightly. Pair each known production incident with both a component test (the broken step) and an end-to-end case (the user-visible symptom).
Gating Agent Quality in CI
Evaluation that does not block a bad merge is theater. The point of the harness is to make agent quality a release gate, the way unit tests gate code.
- Trigger the eval suite on every change to prompts, model version, tool definitions, or orchestration logic — agents have no "code vs. config" line, so config changes must trip the gate too.
- Run the fast deterministic subset on every commit (component checks, trajectory rules, cheap assertions) and the full judged suite on PR / pre-release.
- Gate on a basket of thresholds, not one number — task success ≥ X%, p95 latency ≤ Y, cost per task ≤ Z, zero policy violations, no faithfulness regression beyond tolerance.
- Compare against the current production baseline, not an absolute floor, and fail the build on a statistically meaningful regression in any gated dimension.
- Account for non-determinism — run each case multiple times and gate on the rate with confidence intervals, so flaky variance does not redden a healthy build (or greenlight a sick one).
- Make failures legible — surface which cases regressed, with full trajectories. A red gate with no trace is one engineers learn to bypass.
Budget the suite: a full judged run is slow and costs real tokens, so tier it (commit → PR → release → nightly) to keep the loop fast while coverage stays broad.
Observability Hooks: Closing the Loop
Pre-production evaluation and production observability are the same instrumentation viewed at two points in time. If your agent emits structured traces — every step, tool call, argument, tool result, token count, and latency as spans — you get offline eval scoring, online monitoring, and replay corpora from one pipeline.
Concretely, instrument these hooks (the agent observability glossary entry covers the data model in depth):
- Step-level spans with inputs, outputs, tool name, arguments, result, tokens, and duration — the atomic unit both your evaluator and your dashboards consume.
- Trajectory IDs that stitch all steps of one task into a replayable trace.
- Online scorers that run a sampled subset of your offline checks against live traffic, so production drift becomes a metric, not a surprise.
- Tool-boundary capture — record attempted and realized actions, since a blocked unsafe call is a critical safety signal a results-only log discards.
This is also where the production case for a registry shows up. An agent platform that fronts tools through a governed gateway — the pattern in MCP gateway auth and discovery — gets tool-call telemetry, identity, and policy enforcement at the boundary for free. Jarvis Registry, for example, treats those gateway traces as both the audit log and the replay corpus, so the same spans that prove compliance also feed the regression suite. The asset that lets you evaluate the agent is the same asset that lets you operate it.
Frequently Asked Questions
How is agent evaluation different from LLM evaluation?
LLM evaluation scores a single input-output pair against a reference; agent evaluation scores a multi-step trajectory of tool calls whose outcome and process must both be judged, often across many runs because agents are non-deterministic and mutate external state. You add dimensions a model never has — trajectory correctness, cost per task, tool-call safety — and you need sandboxed tools because the agent takes real actions.
What metrics should I track for AI agents in production?
At minimum: task success rate, trajectory / tool-call correctness, faithfulness (grounding), cost per task (tokens + tool calls), latency at p95, and policy-violation count. Track them as distributions across repeated runs, compare each release against the production baseline, and alert on guardrail metrics like out-of-policy actions or PII leakage rather than only on aggregate quality.
Is LLM-as-a-judge reliable enough for production gating?
It is reliable for subjective dimensions if you control its biases — position, verbosity, and self-preference — with a tight binary rubric, order-swapped scoring, a reference answer, and continuous calibration against human labels. For anything with a verifiable end-state (task success, tool-call correctness), use a deterministic programmatic check instead; reserve the judge for faithfulness, helpfulness, and tone.
How do I build a gold-standard eval set for agents?
Mine real production logs and support transcripts, over-sample the hard tail and incident-causing cases, and write a machine-checkable success condition (end-state, not just the answer) for each. Keep a frozen regression set plus a rotating fresh set, stratify by difficulty, include red-team and refusal cases, and anchor judged criteria against a few hundred human-labeled examples.
Should I run agent evaluations offline or online?
Both, because they catch different failures. Offline regression and replay suites run in CI against sandboxed tools to catch regressions cheaply and safely before release; online shadow, canary, and A/B testing run against real traffic to catch the gaps your mocks and curated inputs never anticipated. Gate releases on offline scores, then validate the survivor with a guarded online rollout.
Which agent evaluation frameworks and benchmarks should I look at?
For tracing and judged evals, LangSmith and OpenAI Evals are common starting points; for RAG-style faithfulness, Ragas. For benchmarking agent capability, look at τ-bench (tool-agent-user interaction), SWE-bench (software tasks), AgentBench (multi-environment), and HELM for holistic model assessment. Treat public benchmarks as capability sanity checks, not substitutes for an eval set built from your tasks and tools.
Citations and References
- OpenAI Evals — framework for evaluating LLMs and LLM systems. https://github.com/openai/evals
- LangSmith — tracing, evaluation, and monitoring for LLM and agent applications. https://docs.smith.langchain.com/
- Ragas — evaluation framework for retrieval-augmented generation pipelines. https://github.com/explodinggradients/ragas
- Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." https://arxiv.org/abs/2306.05685
- τ-bench (tau-bench): "A Benchmark for Tool-Agent-User Interaction in Real-World Domains." https://github.com/sierra-research/tau-bench
- SWE-bench — "Can Language Models Resolve Real-World GitHub Issues?" https://www.swebench.com/
- AgentBench — "Evaluating LLMs as Agents." https://github.com/THUDM/AgentBench
- HELM (Holistic Evaluation of Language Models), Stanford CRFM. https://crfm.stanford.edu/helm/
- OpenTelemetry — Semantic Conventions for Generative AI / agent spans. https://opentelemetry.io/docs/specs/semconv/gen-ai/