RAG Architecture Decisions That Actually Matter in Production

A RAG prototype is quick to build and easy to show. The architectural decisions that determine whether it survives production are often made casually during that prototype phase. That is usually not because engineers are careless, but because the constraints of production are not visible in a prototype environment.

By the time those decisions become visibly expensive, they are already embedded in the system in ways that are hard to change without significant rework. Retrieval quality drops on the real corpus. Costs exceed the model. Tenant isolation fails a security review.

Understanding which decisions actually matter before they are made changes what kind of system gets built.

The decisions that determine your retrieval quality

The embedding model is the first consequential decision. Different embedding models have different domain strengths: a model trained on general web text handles conversational queries well but may perform poorly on technical documentation, legal language, or domain terminology. Testing embedding models against a representative sample of your actual corpus and actual query types is the only way to know which one fits. Generic benchmarks are not enough.

The choice between a single large vector index and multiple smaller ones matters at scale and in systems serving multiple tenants. A single large index is simpler to operate and query, but retrieval quality can degrade as the index grows because more candidates mean more noise in the results. Multiple indices per domain or per tenant can improve quality but add operational complexity and cost.

The reranking layer is where retrieval quality can be meaningfully improved at the cost of added latency and compute. A reranker scores each retrieved candidate against the original query using a model optimised for relevance scoring. It is substantially more powerful than cosine similarity, but slower. Whether the quality improvement justifies the added latency depends on the use case, and that tradeoff should be made explicitly.

The generation layer: where most production failures happen

The prompt that works in development stops working reliably in production for two predictable reasons: the retrieved content differs from what was used in development, and user queries differ from the test queries. Both of these are guaranteed to happen.

Production prompt engineering requires designing prompts that hold up when retrieval quality varies. The prompt needs to tell the model clearly how to handle cases where the retrieved content is irrelevant, insufficient, or contradictory. A prompt that works well on strong retrieval and degrades on weak retrieval produces outputs whose quality varies with retrieval rather than tracking it cleanly.

Prompt regression is a specific risk when the system evolves. A change to the retrieval pipeline changes what content appears in the context window. A model update changes how the model responds to the same prompt. Either of these can silently degrade output quality. Testing the generation layer against a fixed evaluation set after any meaningful change is the only way to catch this reliably.

The evaluation architecture: the part most teams skip

A production RAG system without evaluation is a system you are running blind. You can see uptime and latency. You cannot see whether the answers are good.

The evaluation architecture has two components: an offline evaluation pipeline and an online evaluation mechanism. The offline pipeline runs a fixed set of queries with known correct answers against the system, producing scores that can be tracked over time and compared across versions. The online mechanism samples production traffic and routes it for manual review or automated scoring, providing a continuous signal on live quality.

Building this after the system is live is substantially harder than building it before launch, because the evaluation dataset is the bulk of the work. That dataset needs representative queries with verified answers against your specific corpus. The data that goes into it is best collected and checked during development, before the ground truth becomes obscure.

Measuring RAG Quality: Retrieval Evaluation Beyond Vibes →

The decisions you cannot easily reverse

Some architectural choices are easy to change later. Others accumulate dependencies that make change expensive.

The vector store is harder to migrate than it appears. Moving an index between vector databases requires embedding the entire corpus again, which has time and cost implications that grow with corpus size. Choosing a vector store based solely on what is easy to get started with, then discovering its access control model or scaling characteristics do not match production requirements, is an expensive mistake.

Tenant isolation architecture is the other decision that is very hard to reverse. Building a system where all tenants share an index with metadata filtering is different from building one where tenants have separate indices. Converting between them requires ingesting all content again and reworking all the retrieval code. This decision should be made based on the strictest security and isolation requirements, not the average case.

Multi Tenant RAG: Enforcing Data Isolation When Multiple Clients Share a System →

Architecture decisions compound

Good architectural decisions in a RAG system are those that can be revisited cheaply when requirements change. Bad ones are those that accumulate dependencies until changing them requires rebuilding the system.

Most of the decisions in this piece fall into one category or the other, and the categorisation is not always intuitive. Treating them as consequential from the start is the difference between a system that evolves gracefully and one that needs to be rebuilt once it has become genuinely useful.