Production RAG: Beyond the Chatbot Demo

Retrieval augmented generation is genuinely useful. For organisations with large bodies of internal knowledge, such as policy documents, technical manuals, case notes, and contracts, the ability to ask questions in natural language and get answers grounded in the actual documents is a meaningful capability. The proof of concept is relatively quick to build, and when it works, it can be impressive.

The production system is a different problem.

A production RAG system has to handle documents it was not tuned on, queries it was not tested for, and users who do not know what the system knows. It has to stay accurate as the knowledge base changes. It has to enforce access controls so that one user cannot inadvertently retrieve another user's documents. It has to be cheap enough to operate at scale. It also has to be good enough that users trust it, which means it has to be honest about when it does not know something rather than generating a plausible answer that is wrong.

Each of those requirements is solvable. But each one involves design decisions that do not appear in the tutorials, and getting them wrong creates problems that compound.

Document ingestion is harder than it looks

Every RAG system starts with ingestion: getting your documents into a form the retrieval system can search. In a tutorial, this means loading a PDF and splitting it into chunks. In production, it means processing a corpus that contains PDFs scanned at varying quality, Word documents with complex formatting, Excel files that encode information spatially in ways that lose meaning when read linearly, HTML exports with navigation menus embedded in the content, and files with inconsistent naming, duplicates, and versions that contradict each other.

Each of these document types requires different handling. Scanned PDFs need OCR, and the quality of the OCR determines the quality of everything downstream. Tabular data in Excel or PDFs needs a parser that understands that the value in a cell derives its meaning from its row and column headers. Extracting it as plain text loses the context that makes it answerable.

Beyond format handling, the ingestion pipeline needs to be reliable and repeatable. Documents change. New documents are added. Old documents are superseded. A production ingestion pipeline tracks what has been processed, detects changes, handles failures without corrupting the index, and provides a clear view of what the retrieval system currently knows.

The ingestion pipeline is not the exciting part of a RAG system. It is, in our experience, the part that determines whether the retrieval quality is actually usable.

Document Ingestion Pipelines for RAG: Getting the Foundation Right →

Chunking strategy and retrieval quality

Once documents are ingested, they need to be split into chunks for embedding and retrieval. The way you chunk determines what the system can and cannot answer accurately.

Fixed size chunking, where text is split every N tokens with an overlap of M, is simple and often used in tutorials because it requires no document understanding. It also produces chunks that split sentences in the middle, separate headings from the content they introduce, and fragment tables or lists in ways that make the individual chunks nearly meaningless.

The production alternatives are more work but substantially better: semantic chunking that splits at natural boundaries, hierarchical chunking that maintains parent and child relationships so that a retrieved chunk can be expanded to include surrounding context, and document structure aware chunking that treats a heading and its content as a single unit.

The right chunking strategy depends on the document types and the kinds of queries you are handling. A corpus of short policy documents has different characteristics from a corpus of long technical manuals. A system expected to answer specific factual lookups has different requirements from one expected to synthesise across multiple documents. There is no universal answer, and defaulting to fixed size chunking because it is the first option in the tutorial is a reliable way to get mediocre retrieval quality.

Chunking Strategies for Retrieval Quality: What the Tutorials Don't Tell You →

When vector search is not enough

Vector similarity search, where the system embeds a query, embeds documents, and finds the nearest neighbours by cosine distance, works well for queries that are semantically related to the content you are searching for. It is less reliable for specific lookups.

If a user asks about "section 4.2 of the procurement policy," they are not asking a semantic question. They are asking for a specific thing. A vector search will return the most semantically similar content, which may or may not include section 4.2 of the procurement policy. Keyword search, using BM25 or similar, would find it reliably. The most robust production retrieval systems use both, combining vector and keyword search and reranking the combined results.

Hybrid retrieval is more complex than pure vector search. It requires a retrieval architecture that can run both query types, combine the result sets, and apply a reranker that scores each candidate against the original query. That reranker adds latency and cost, which need to be managed. But for corpora where users need to find specific things as well as semantically similar things, which is most real enterprise knowledge bases, the quality improvement is significant enough to be worth it.

Hybrid Search in RAG: When Vector Search Alone Isn't Enough →

Evaluation: knowing whether the system is actually working

RAG evaluation is difficult, and most teams underinvest in it. The usual approach is: build the system, try some test queries, decide it seems good enough, ship it. The system's actual performance on the broader distribution of real queries is largely unknown until users encounter it.

Production RAG evaluation requires two separate pipelines. Retrieval evaluation asks whether the retrieved chunks are relevant to the query. Are we finding the right documents? Generation evaluation asks whether the generated answer is accurate, complete, and properly grounded in the retrieved content. Is the LLM using the documents correctly rather than hallucinating alongside them?

These two components fail for different reasons and need to be measured separately. A strong retrieval system paired with a generation step that ignores the retrieved docs in favour of training data produces inaccurate answers. A generation step that faithfully uses retrieved content, paired with poor retrieval, also produces inaccurate answers. Measuring only end to end answer quality does not tell you which component is failing.

The evaluation infrastructure is easiest to build before the system goes live, because it requires a ground truth dataset: query and answer pairs with known correct answers against your specific corpus. Building that dataset is work, but it is the only way to know whether the system is actually working rather than appearing to work on the queries you tried.

Measuring RAG Quality: Retrieval Evaluation Beyond Vibes →

Multi-tenancy and data isolation

In any RAG system where multiple users or client organisations access a shared vector store, data isolation is a hard requirement. A retrieval that surfaces one client's documents in another client's query results is a serious incident regardless of how unlikely the probability looks in a test environment.

Vector databases do not automatically enforce row level security. The mechanism for filtering results to a specific tenant needs to be designed and implemented at the retrieval layer, tested explicitly, and maintained as the system evolves. Metadata filters applied at query time are the most common approach: tagging every chunk with a tenant identifier and filtering on it. This needs to be implemented consistently across the entire ingestion pipeline, or you end up with chunks that are either untagged or incorrectly tagged.

There are cases where separate vector stores per tenant are the appropriate architecture: better isolation, simpler access control, and in some cases better retrieval quality because the retrieval space is smaller. This comes at higher operational overhead and cost, which makes it the right choice for some implementations and wrong for others. The decision should be made explicitly, with awareness of the tradeoffs, not defaulted to based on what happened to be easiest to implement first.

Multi Tenant RAG: Enforcing Data Isolation When Multiple Clients Share a System →

Grounding, citation, and accountability

One of the core promises of RAG is that answers are grounded in specific source documents, which means those sources can be shown, verified, and held accountable. Delivering on that promise in a way that is operationally useful requires deliberate engineering, not simply a list of source document names at the bottom of an answer.

Useful citation means showing specific passages, sections, or page numbers, rather than only the document name. It means giving users a direct way to navigate to the source. In regulated contexts, it means ensuring the cited content is exactly what the model used, not a later rationalisation.

The architecture implications are significant. Chunk metadata needs to carry enough information to construct a useful citation. The generation step needs to be prompted and constrained to cite specific retrieved content rather than paraphrase freely. The UI needs to surface citations in a form that users will actually check, not a tooltip that nobody opens.

Systems that do this well tend to earn sustained trust. Users develop confidence in the outputs because they can verify them, and occasionally do. Systems that do it badly, where citations are present but not useful, tend to erode trust the moment someone discovers the citation does not match the answer.

Keeping the knowledge base current

A RAG system is only as good as the knowledge it has access to. Documents change, new documents are added, old documents are superseded, and a system that does not track these changes will progressively diverge from reality.

The production ingestion pipeline needs a strategy for each of these cases. New documents should be detected and ingested automatically or on a defined schedule. Changed documents should trigger ingestion of the affected content again, not full processing of the entire corpus. Deleted or superseded documents need to have their embeddings removed. A system that continues returning content from retracted policies or outdated versions is a liability.

Versioning adds complexity but is often necessary in regulated contexts: the ability to query the state of the knowledge base as it existed at a specific point in time. Some document types, including policies, contracts, and regulatory guidance, change in ways that matter for historical records, and the system needs to distinguish between "what does this document say now" and "what did this document say when this decision was made."

Private RAG: keeping sensitive data off the public cloud

For organisations with strict data residency requirements or handling genuinely sensitive content, such as legal privilege, medical records, or commercially sensitive IP, sending document content and queries to an external API is not an option, regardless of the API provider's attestations about data handling.

Private RAG means running the full pipeline within your own infrastructure: embedding model, LLM, and vector store. This was operationally difficult two years ago and is now achievable by organisations that are not large enough to have their own AI research teams. Hosted in house, models at the sizes needed for most enterprise RAG use cases run on a single high memory GPU server. Open weight embedding models match or exceed the performance of closed API equivalents on domain retrieval tasks. Vector databases hosted in house handle the scale required by most enterprise knowledge bases without the complexity of managing a multi node cluster.

The tradeoffs are real: higher upfront infrastructure cost, operational responsibility for model serving, and the need to stay current on model improvements without relying on a vendor's update cycle. But for applications where the alternative is not building the system at all, private deployment is often the only viable path.

The gap between demo and production

A RAG demo involves a handful of clean documents, a set of queries you chose because you knew they would work, and an LLM that is allowed to use its full training data to fill gaps in the retrieved content. It is a useful tool for establishing feasibility. It is not evidence that a production system will work.

The production system has thousands of documents, some of which are low quality. It has queries from users who do not know what the system knows. It has access controls that need to enforce correctly every time. It has a cost model that needs to stay within budget. It has an evaluation pipeline that tells someone when quality drifts.

Getting from demo to production is a significant engineering effort. Not because the technology is exotic, but because every component of the pipeline needs to be production quality, and most of the work is in the parts that do not appear in the demo. The organisations that do it well tend to be the ones who understood that from the start, rather than discovering it after the demo had set expectations.