Vector search is the foundation of most RAG systems. It works by finding chunks of content that are semantically similar to the query: content that means the same thing even if it uses different words. This is genuinely useful for the majority of retrieval tasks. It is also systematically poor at a specific and common class of queries where the user is looking for exact terminology, specific identifiers, proper nouns, or numerical values.
A RAG system built on pure vector search will handle 'what is our refund policy for software licences?' well and handle 'find the clause about termination in contract MSA-2024-0447' poorly. The latter query has a correct answer in the knowledge base, but the answer depends on exact match rather than semantic similarity. Vector search is not designed for exact match.
Hybrid search combines vector search with keyword search to close this gap. Understanding when it matters, and how to implement it without overengineering, determines whether it is the right investment for a specific system.
What vector search is bad at
Vector search converts both the query and the document chunks into numerical representations, then retrieves the chunks whose embeddings are closest to the query embedding. The closeness is a measure of semantic similarity. Chunks that are about the same topic, or use related concepts, score highly even when they use different words from the query.
This fails for queries where the user knows the exact term and that exact term is what distinguishes the right answer from related but wrong answers. A product catalogue has entries for 'Model X-450 Pro' and 'Model X-450 Standard'. A query for 'X-450 Pro specifications' via vector search may retrieve both, because the two products are semantically similar. The user wanted the Pro. The ambiguity is invisible in vector similarity scores.
The same problem applies to numeric values, dates, codes, reference numbers, and any other token where meaning is carried by the exact characters rather than by semantic content. 'Clause 12.3(b)' means a specific thing that is different from 'Clause 12.3(a)', and embedding similarity cannot reliably distinguish them.
What keyword search gives you
Keyword search (BM25 and its variants) works on term frequency. It scores documents based on how often the query terms appear in them, weighted by how rare those terms are across the corpus. Documents that contain the exact query terms, especially rare terms, score highly. Documents that contain semantically related terms but not the exact query terms score lower.
This is the inverse of vector search's failure mode. BM25 handles 'MSA-2024-0447' well because that string appears verbatim in the matching document and rarely elsewhere. It handles 'our cancellation policy' poorly because the policy document may be titled 'Subscription Termination and Withdrawal Terms' and never use the word 'cancellation'.
The complementarity is clean. Vector search handles natural language queries about concepts. Keyword search handles queries about specific terms, identifiers, and exact phrases. Most live query populations contain both types, which is why production RAG systems that rely on a single retrieval method tend to have a predictable failure pattern that correlates with query type.
Implementing hybrid search: the practical approach
The most widely used fusion approach for hybrid search is Reciprocal Rank Fusion (RRF). Both retrieval systems run against the query independently and produce a ranked list of results. RRF combines the two ranked lists into a single ranking by summing the reciprocal of each document's rank in each list. Documents that appear highly ranked in both lists end up at the top of the combined list. Documents that appear in one list but not the other are included at their appropriate rank.
RRF is simple to implement, robust to differences in score scale between the two systems, and does not require tuning beyond the rank combination formula. It is a reasonable default for most production systems.
Score normalisation and fusion is an alternative for cases where you want to weight the two retrieval signals differently. You might apply more weight to vector search for a knowledge base over documents, and more weight to BM25 for search over structured content. This requires careful calibration and tends to be justified only when retrieval quality analysis on the specific corpus and query population shows that the simple RRF default is underperforming.
Query routing as an alternative
Rather than fusing the results of two retrieval systems for every query, query routing classifies each query before retrieval and directs it to the retrieval method that will serve it best. Queries that contain identifiers, codes, or quoted phrases are directed to keyword search. Queries that are natural language descriptions of concepts are directed to vector search. Queries that are genuinely ambiguous use both.
Routing reduces compute cost compared to running both retrieval systems on every query, which matters at high request volumes. It also allows the retrieval methods to be tuned independently: the keyword index parameters can be optimised without affecting vector search, and vice versa.
The constraint is that the routing classifier needs to be reliable. A misclassified query gets worse results than hybrid fusion would have provided, whether that means a terminology query sent to vector search or a conceptual query sent to keyword search. Building and validating the routing classifier adds complexity, which is why hybrid fusion is usually the right starting point, with routing as a later optimisation.
When hybrid search is not the answer
Hybrid search adds operational complexity: two retrieval systems to maintain, a fusion layer to build and monitor, and an evaluation strategy that accounts for both retrieval modes. This complexity is justified when the retrieval quality problem it solves is real and material. It is not justified when it is added speculatively, before retrieval quality has been measured against the actual query population.
The first step is always to characterise the query population. If the overwhelming majority of queries are natural language conceptual queries, vector search tuning may close the quality gap without the operational overhead of hybrid retrieval. That tuning might involve embedding model selection, chunking strategy, or reranking. If the query population contains a significant proportion of identifier and exact phrase queries, hybrid search is justified.
Retrieval quality measurement comes before retrieval architecture decisions. Building hybrid search before understanding whether vector search is actually failing on the real query population is an optimisation for a problem you have not confirmed exists.
Retrieval quality determines system quality
The generation layer of a RAG system cannot produce good answers from bad retrieval. Improving retrieval quality, whether through vector search tuning, hybrid search, or both, is the highest leverage investment in a RAG system that has a quality problem.
The decision about which retrieval approach to use should be driven by measurement against the actual query population, not by what is used in tutorials or what is simplest to implement. Those are different problems.
More in this series
- RAG Architecture Decisions That Actually Matter in Production
- Document Ingestion Pipelines for RAG: Getting the Foundation Right
- Chunking Strategies for Retrieval Quality: What the Tutorials Don't Tell You
- Hybrid Search in RAG: When Vector Search Alone Is Not Enough
- Measuring RAG Quality: Retrieval Evaluation Beyond Vibes
- Multi Tenant RAG: Enforcing Data Isolation When Multiple Clients Share a System
- Keeping RAG Knowledge Bases Fresh: Ingestion, Versioning, and DriftComing soon
- Grounding and Citation in RAG Outputs: Making AI AccountableComing soon
- Private RAG Deployments: Running LLMs Without Sending Data to the CloudComing soon