RAG Evaluation: Measure Retrieval Quality

Most RAG failures are retrieval failures. The model did not hallucinate because it is dumb; it hallucinated because the right chunk was never in the context window, and you cannot improve what you do not measure. That is the whole discipline of RAG evaluation: retrieval quality is the dominant lever on answer quality, it is measurable with decades-old information-retrieval math, and teams that skip the measurement step ship systems whose accuracy they can only describe with adjectives.

This is for the architects and platform engineers who own a retrieval-augmented generation pipeline in production, or are about to. It is opinionated about one thing: you should be able to put a number on retrieval quality — recall, precision, ranking — separately from the model that consumes it, before that pipeline touches a user. What follows is the metric set, the gold set, the component ablations, and the CI gating that get you there.

Why retrieval is the hardest part of the stack

Generation gets the attention because it is visible. But in a typical enterprise deployment the generator is a frozen commercial model; the retrieval layer is the part you actually built, and therefore the part that is wrong. Retrieval is hard for reasons that compound:

Vocabulary mismatch. The user asks about "termination for cause" and the relevant clause says "material breach." Dense embeddings help, but they are trained on web-scale text, not your contract corpus, and they smear domain jargon.
Chunking destroys context. A fixed 512-token splitter cuts a table in half and orphans a heading from its body, producing chunks that are individually retrievable but meaningless.
The long tail. Aggregate accuracy looks fine because most queries are easy lookups. The multi-hop, ambiguous, and adversarial ones are the queries that matter, and they vanish in a single averaged score.
Silent failure. When retrieval misses, the generator does not error — it confabulates a fluent, plausible, wrong answer. There is no stack trace.

That last point is why measurement is non-negotiable: a retrieval miss looks exactly like a success until a human checks the source. If you deploy agents that chain retrieval calls — see our agentic RAG glossary entry — the miss compounds across hops, and the failure surface outgrows your ability to eyeball it.

Separate retrieval evaluation from generation evaluation

The single most useful structural decision in RAG evaluation is to treat the pipeline as two systems with two scorecards.

Retrieval evaluation asks: given a query, did the system fetch the right context? It is a pure information-retrieval problem — no LLM, milliseconds to run, deterministic, graded against a labeled set of relevant chunks. This is the classic IR setting TREC has scored for thirty years.

Generation evaluation asks: given the retrieved context, did the model produce a faithful, relevant answer? This is fuzzier, often requires an LLM judge, and is only meaningful conditioned on the retrieval result.

Why insist on the split? Because a bad end-to-end answer has two possible causes and you must know which one. If recall is low, no amount of prompt engineering will save you — the answer is not in the context to begin with. If recall is high but groundedness is poor, your generator or prompt is the problem and your retriever is fine. A single "RAG score" tells you the patient is sick but not which organ; diagnose retrieval first, because it is cheaper to measure and more often the culprit.

Retrieval metrics and what each one tells you

Retrieval evaluation borrows directly from IR. Each metric answers a different question, and you want several because they trade off.

Metric	What it measures	When to use it
Recall@k	Of all the chunks that are relevant, what fraction landed in the top k	Your north star — RAG can only use what it retrieves. Set k to the context budget
Precision@k	Of the top k retrieved, what fraction are actually relevant	When noise/distractor chunks degrade the generator or inflate token cost
Hit rate (recall@k with ≥1 hit)	Did at least one relevant chunk make the top k	Single-fact QA where one good chunk is enough; coarse but intuitive
MRR (Mean Reciprocal Rank)	Reciprocal of the rank of the first relevant chunk, averaged	When the first good result matters most (the generator weights early context)
nDCG@k	Ranking quality with graded relevance and position discounting	When relevance is not binary and ordering inside the top k matters

A few notes that save grief:

Recall@k gates everything. If the relevant chunk is not in the top k, every downstream metric is moot. Tune k to whatever fits your context window, then optimize recall there.
MRR vs nDCG. MRR cares only about the first relevant hit and treats relevance as binary. nDCG handles graded relevance and discounts by position, closer to how a generator weights its context. Use MRR for single-answer lookup, nDCG for richer multi-document synthesis.
Precision is not free to ignore. High recall with low precision stuffs the window with distractors — costing tokens and pulling the generator toward a confidently wrong but well-supported-looking answer.

Generation and answer metrics

Once retrieval is solid, you grade the answer. These metrics are newer and several require an LLM judge.

Faithfulness / groundedness. Is every claim in the answer supported by the retrieved context? This is the headline metric for hallucination. The common implementation decomposes the answer into atomic claims and checks each against the context — the approach popularized by Ragas.
Answer relevance. Does the answer actually address the question, regardless of grounding? An answer can be perfectly faithful to the context and still not answer what was asked.
Context precision. Of the retrieved chunks, how many were useful for the final answer, weighted toward the top of the ranking — a retrieval-quality signal computed from the generation side.
Context recall. Did the retrieved context contain everything needed to produce the ground-truth answer? Requires a reference answer; it is the generation-side mirror of recall@k.
Citation accuracy. When the answer cites a source, does that source actually support the claim? Critical wherever users click through to provenance — and the failure mode regulators care about most.

The relationship worth internalizing: context precision and context recall are retrieval diagnostics measured through the lens of the answer. If faithfulness is low but context recall is high, the context was sufficient and the generator failed; if context recall is low, retrieval failed.

Building a gold-standard evaluation set

Every metric above is meaningless without labels. The gold set is the most expensive and most valuable artifact in your RAG program, and there is no way to buy your way out of building one for your corpus. It has three parts:

Real queries. Pull them from production logs, support tickets, and search history — not from a brainstorm and not exclusively from an LLM. Synthetic queries (LLM-generated from your documents) are fine for bootstrapping volume, but they are systematically easier than real ones because they were written by reading the answer.
Labeled relevant chunks. For each query, a human (ideally a domain expert) identifies which chunks are relevant, and at what grade if you want nDCG. This is the labor-intensive part. Aim for breadth across query types — single-fact, multi-hop, comparative, ambiguous — over raw count. A focused 150–300 query set with careful labels beats 5,000 sloppy ones.
Hard negatives. Deliberately include chunks that are topically close but wrong: right product, wrong region; right clause, prior version; right entity, different fiscal year. They separate a re-ranker that works from one that just reshuffles obviously-relevant results, and a test set without them flatters every component.

Practical guidance:

Report per-stratum, never just a global average. The average hides the multi-hop collapse that is your real risk.
Version the gold set like code. A metric movement is only interpretable against a fixed test set, so bump a version whenever you relabel.
Budget for drift. Refresh a slice of the gold set quarterly from recent production queries.

Component ablations: measure each part in isolation

A RAG retriever is a pipeline — chunker, embedding model, vector index, optional sparse/hybrid layer, optional re-ranker. End-to-end recall tells you the pipeline's score, not which component to fix. Ablation does: hold the gold set fixed, change one component, re-measure. The components worth ablating, and the question each answers:

Chunking strategy. Fixed-size vs recursive/structure-aware vs semantic, plus size and overlap. Does recall@k change when I stop splitting tables and headings? Usually the highest-leverage, most-overlooked knob.
Embedding model. Swap the dense encoder. Does a domain-tuned or larger model lift recall on my hard queries? The MTEB leaderboard ranks general retrieval, but your corpus is the only benchmark that decides — use MTEB to shortlist, your gold set to choose.
Hybrid search. Add lexical (BM25) alongside dense and fuse. Does keyword matching recover the exact-term and rare-entity queries that dense retrieval smears? Most production vector databases document a hybrid mode for exactly this.
Re-ranker. Add a cross-encoder over the top-N candidates. Does re-ranking lift precision@k and MRR without hurting recall? Re-rankers improve ordering, not the candidate pool — if recall@N is already low, a re-ranker cannot save you, which is why you ablate it last.

Two rules make ablation honest:

One variable at a time. Change the chunker and the embedding model together and you have learned nothing about either.
Ablate in pipeline order. Recall first (chunking, embedding, hybrid), then ranking (re-ranker). A re-ranker measured on a low-recall candidate pool looks useless; fix the pool before you grade the sort.

Measure cost too: a re-ranker that adds a few points of nDCG and tens of milliseconds of latency is a different decision in an interactive chat versus a batch job.

LLM-as-judge for groundedness, and its caveats

You cannot label faithfulness at scale by hand, so the field uses a strong LLM as the grader — the "LLM-as-a-judge" pattern. It works well for groundedness and answer relevance and is the engine under most of Ragas's generation metrics. It is also genuinely flawed, and shipping it naively will mislead you.

Known biases, with mitigations:

Position bias. Judges favor the first option in a pairwise comparison. Swap order and average.
Verbosity bias. Judges reward longer answers. Control answer length, or instruct the judge to ignore length.
Self-preference. A judge prefers answers from its own model family. Use a different model as judge than as generator where you can.
Calibration drift. A model upgrade silently changes your scores. Pin the judge model and version, and treat a judge change as a test-set change.

These limitations are documented in the LLM-as-a-judge literature, notably the MT-Bench / Chatbot Arena work (Zheng et al., arXiv:2306.05685). The practical posture: use the judge for groundedness because there is no cheaper alternative at scale, but validate it against human labels on a sample. Compute judge-versus-human agreement on a few hundred examples; if it is low, the verdicts are not trustworthy for your domain — a judge you have not validated is a number generator, not a measurement.

Offline versus online evaluation

Everything so far is offline — a fixed gold set, scored in CI, deterministic. Offline eval is necessary and insufficient: it tells you whether answers match your labels, not whether real users find them useful. Online evaluation closes that gap with production signals:

Implicit feedback. Click-through on cited sources, dwell time, copy events, and query reformulation rate (a reformulation often signals a failed first answer).
Explicit feedback. Thumbs up/down, star ratings, "was this helpful."
Answer acceptance. In an agentic or copilot setting, did the user accept the generated artifact, edit it, or discard it — the strongest signal you get.

Online signals are noisy, biased (people rarely click thumbs-down), and lagging, but they are the only ground truth for usefulness. Run both: offline eval as the pre-ship gate, online eval as the post-ship reality check, with production failures fed back into the gold set so the offline suite keeps getting harder. Capturing those signals reliably is an observability problem, covered in our agent observability glossary entry.

A note on scope: this evaluates retrieval, not how the agent decides when to retrieve. When a retrieval-augmented pattern beats a tool-calling one is a separate decision, laid out in our MCP vs RAG comparison, and the deeper architecture of enterprise retrieval systems lives in our Enterprise RAG pillar.

CI gating and drift monitoring

Evaluation that runs once before launch and never again is theater. Wire the fast, deterministic offline suite into the same place your unit tests live.

CI gating. On every change that touches the retrieval pipeline — new embedding model, chunker config, prompt edit, index rebuild — run the offline suite and gate the merge on it:

Compute recall@k, precision@k, MRR, and nDCG against the pinned gold set.
Compute faithfulness and answer relevance with the pinned judge model.
Fail the build on a regression beyond a tolerance, per stratum — a drop on multi-hop queries hidden behind a flat overall score is the one you most need to catch.
Report the per-stratum scorecard as a build artifact so reviewers see the trade-off the change made.

Drift monitoring. In production, the pipeline degrades even when the code does not:

Data drift. New documents enter the corpus and the query distribution shifts. Re-run offline eval on a schedule and feed fresh production queries into the gold set.
Model drift. A managed embedding endpoint or judge model is upgraded under you. Pin versions where the provider allows it; alarm on score movement where it does not.
Online-metric alarms. Treat a sustained drop in answer acceptance or a spike in query reformulation as a production incident, and trace it to a retrieval regression.

As one production reference point, the editorial pipeline behind ASCENDING's Jarvis Registry treats retrieval recall as a release gate the same way a unit-test suite gates code — a regression on the labeled set blocks the deploy. The plumbing that exposes those retrievers as governed, discoverable tools is the gateway layer described in MCP gateway auth and discovery. The lesson generalizes: if retrieval quality is not a gate, it silently regresses, and the first to notice is a user reading a confidently wrong answer.

Frequently asked questions

What is the difference between recall@k and context recall in RAG evaluation?

Recall@k is a pure retrieval metric: of all chunks labeled relevant in your gold set, what fraction appeared in the top k results. Context recall (as defined by Ragas) is measured from the generation side — whether the retrieved context contained everything needed to produce the ground-truth answer, which requires a reference answer. Use recall@k to tune the retriever directly; use context recall when you only have end-to-end traces.

Do I need a gold-standard dataset to evaluate RAG, or can I rely on LLM-as-a-judge alone?

You need a gold set. LLM-as-a-judge can score faithfulness and relevance without labels, but it only sees what was retrieved, not what should have been, so it cannot measure whether retrieval fetched the right chunks — and recall is where most RAG systems actually fail.

How many queries does a RAG evaluation set need to be statistically useful?

There is no universal number, but coverage of query types matters more than raw volume. A carefully labeled set of roughly 150–300 real queries — stratified across single-fact, multi-hop, comparative, and ambiguous cases and including hard negatives — gives more signal than thousands of synthetic ones written by reading the answers.

Should I evaluate retrieval and generation separately or just measure end-to-end answer quality?

Separately, and retrieval first. End-to-end quality tells you the system is wrong but not why; splitting the scorecard tells you whether the relevant context was missing (a retrieval problem no prompt can fix) or present-but-misused (a generation problem). Retrieval is also cheaper and faster to measure, since it needs no LLM.

Can I trust an LLM judge for groundedness scoring in a regulated domain?

Only after you validate it against human labels in that domain. LLM judges carry documented position, verbosity, and self-preference biases, and their calibration drifts on model upgrades. Compute judge-versus-expert agreement on a few hundred examples; if it is low, tune the rubric or fall back to human review for the high-stakes slice.

How do I know whether a re-ranker is actually helping my RAG pipeline?

Ablate it: hold the gold set and every other component fixed, measure precision@k and MRR with and without the re-ranker, and confirm recall@k does not drop. A re-ranker only reorders the candidate pool, so if recall@N is already low it cannot help — which is why you fix chunking, embeddings, and hybrid search first.

Citations and References

Ragas — RAG evaluation framework (faithfulness, answer relevance, context precision/recall). GitHub: https://github.com/explodinggradients/ragas
Ragas documentation — metric definitions and usage. https://docs.ragas.io
BEIR — heterogeneous benchmark for zero-shot information retrieval. GitHub: https://github.com/beir-cellar/beir
MTEB — Massive Text Embedding Benchmark (embedding/retrieval model leaderboard). GitHub: https://github.com/embeddings-benchmark/mteb
Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (2023). arXiv: https://arxiv.org/abs/2306.05685
TREC — Text REtrieval Conference, the reference setting for IR metrics (recall, precision, MRR, nDCG). https://trec.nist.gov
Normalized Discounted Cumulative Gain (nDCG) — definition and formulation. https://en.wikipedia.org/wiki/Discounted_cumulative_gain
Mean Reciprocal Rank (MRR) — definition. https://en.wikipedia.org/wiki/Mean_reciprocal_rank