Measuring RAG Quality: Retrieval Evaluation Beyond Vibes

Every team that builds a RAG system runs some form of evaluation. In most cases that evaluation consists of: a developer or domain expert asking the system questions they expect it to handle, reviewing the outputs qualitatively, and concluding that the system is ready. This is a reasonable way to catch obvious failures during development. It is not a quality assurance method that works in production.

Qualitative assessment at development time misses the failure modes that appear in production: queries the team did not anticipate, documents that extract poorly, retrieval quality that varies across topic areas, and quality degradation that accumulates gradually after deployment. A system that passes a qualitative development review can still be systematically failing a class of user queries that nobody tested.

Building a rigorous evaluation framework is mostly the work of building the evaluation dataset. That is the hard part. The measurement infrastructure that uses it is comparatively straightforward.

What retrieval evaluation actually measures

Retrieval evaluation answers the question: for a given query, does the retrieval system return the chunks that contain the answer? It measures the retrieval layer independently of the generation layer, separating the problem of finding the right content from the problem of generating a good response given that content.

The standard metrics are recall@k (what fraction of relevant chunks appear in the top k retrieved results), Mean Reciprocal Rank (where in the retrieved list the first relevant chunk appears), and Normalised Discounted Cumulative Gain (how well the retrieved list is ordered, with highly relevant chunks ranked above less relevant ones). Each of these measures a different dimension of retrieval quality, and degradation in any of them affects response quality in predictable ways.

Retrieval evaluation requires a ground truth dataset: a set of queries paired with the chunks that correctly answer them. Building this dataset means identifying the correct chunks for each query from the actual knowledge base, and it is the bulk of the evaluation work. It cannot be automated without a validated reference system, which creates a circular problem. It has to be done by domain experts who can verify correctness against the source material.

Generation quality: what to measure and how

Once retrieval quality is established, generation evaluation measures whether the model produces good responses given the retrieved context. The dimensions that matter most in production are faithfulness, meaning whether the response accurately reflects the retrieved content without adding unsupported information, and relevance, meaning whether the response actually addresses what the user asked.

Faithfulness is the hallucination metric. A response is unfaithful when it contains claims that are not supported by the retrieved chunks, either because the model confabulated them or because it stretched a partial match too far. Measuring faithfulness requires comparing the response against the retrieved context, which can be done by a secondary model evaluation step or by human review.

Completeness, meaning whether the response addresses all aspects of the query that the retrieved content could support, is a third dimension that matters for knowledge intensive applications. A response that correctly answers part of a multi part question, without noting that the other parts were not addressed, has a quality problem that faithfulness and relevance scores will not catch.

Building the evaluation dataset

The evaluation dataset is the most important artefact in RAG quality management. A good dataset is representative, specific, and maintained. It covers the range of query types, topics, and user populations the system will serve. Each query has identified source chunks and a reference answer. The dataset is updated when the knowledge base changes significantly.

Building it requires domain expertise. The person creating the dataset needs to be able to identify the correct answer to a query from the knowledge base. That requires knowing the material well enough to recognise a correct answer when they see one. This is rarely purely an engineering task.

The evaluation dataset should be treated as a product artefact with the same governance as the knowledge base and the application code. It should be versioned, reviewed when the knowledge base changes, and expanded when new query types are identified in production. A dataset that was representative at launch but has not been updated after six months of production query data is an evaluation instrument measuring the wrong things.

Automated evaluation with LLM judges

Manual evaluation of every production query is not feasible at scale. LLM based evaluation, where a large language model scores responses against criteria, enables automated evaluation at production volumes. The judge model receives the query, the retrieved context, and the generated response, then produces scores for faithfulness, relevance, and completeness.

LLM judges are not perfect evaluators. They have systematic biases, including favouring longer and more confident responses. They can be misled by plausible but incorrect content, and they are not reliable on domain factual questions unless they have relevant domain knowledge. They are, however, substantially better than no evaluation, and they scale in ways that human review does not.

The right approach is calibration: measuring the agreement between LLM judge scores and human expert scores on a sample of queries, understanding where the judge is reliable and where it is not, and using human review for the categories where LLM evaluation is weakest. A calibrated automated evaluation system, supplemented by targeted human review, provides coverage that neither approach achieves alone.

Evaluation as a continuous process

Evaluation at launch is necessary but not sufficient. RAG system quality can degrade after launch through several mechanisms: model updates (by the provider or through a version change) that change generation behaviour; knowledge base changes that introduce new documents that extract or chunk poorly; query distribution shift as users discover new use cases; and retrieval quality degradation as the index grows and retrieval noise increases.

Running the offline evaluation pipeline regularly catches quality regressions before they accumulate. It should run after every significant change and on a periodic schedule regardless of changes. Tracking the metrics over time makes trends visible: a gradual decline in recall@5 over three months is invisible in a single evaluation but apparent in a time series.

The most important discipline is running evaluation before releasing changes, not after. A retrieval pipeline change that improves performance on the queries that motivated the change but degrades performance on others is visible in the evaluation dataset before deployment. It is visible in user complaints after.

Measurement is what separates RAG improvement from RAG guessing

Teams that have a rigorous evaluation framework can improve their RAG systems deliberately: identifying where quality is weakest, making targeted changes, and measuring whether those changes improved things. Teams that do not evaluate rigorously make changes and hope for the best.

The investment in evaluation infrastructure is also the investment that makes the system auditable. A system that can demonstrate its quality with metrics, over time, against defined criteria is a system that can be trusted, not merely used.