Chunking Strategies for Retrieval Quality: What the Tutorials Don't Tell You

Chunking means splitting source documents into pieces that can be embedded and retrieved independently. It is one of the least intuitive parts of RAG system design. The choice of strategy affects which questions the system can answer accurately, how much context the retrieved chunks carry, and how well the retrieval degrades when the query is imprecise.

Fixed size chunking, where text is split every N tokens with some overlap, is the default in most tutorials and the starting point in most implementations. It is also the strategy that produces the worst retrieval quality on most real document corpora, because it ignores the structure of the document entirely.

Understanding what the alternatives look like, and why they matter, is prerequisite to designing a production RAG pipeline that actually works.

Why fixed size chunking fails at scale

Fixed size chunking splits text at arbitrary token boundaries without regard for where the meaning ends. The result is chunks that frequently start in the middle of a sentence, separate a heading from the paragraph it introduces, cut a table in half, or pack the end of one topic and the beginning of another into a single chunk with no thematic coherence.

Such chunks produce retrievals where the returned content is technically from the right area of the document but does not contain the full context needed to answer the query. A question about a specific policy rule retrieves a chunk that contains part of the rule: the premise but not the condition, or the condition but not the exception.

At small corpus scale, this often does not noticeably degrade quality because the retrieval returns enough chunks that the complete information is somewhere in the context window. At scale, with thousands of candidate chunks, the noise overwhelms the signal and the cosine distance between query and chunk stops being a reliable relevance indicator.

Semantic and structural chunking

The alternative to fixed size chunking is splitting at natural document boundaries. For most structured documents, this means splitting at section headings, paragraph breaks, or completed thoughts. Those boundaries exist because the document's author placed them there to indicate a change in topic or argument.

Structural chunking uses the document's own markup to drive splitting decisions. For HTML or Word documents, headings define natural splits. For PDFs where the layout can be recovered, section breaks and page breaks provide similar guidance. The resulting chunks tend to have higher internal coherence because each chunk covers a complete idea, which improves embedding quality and retrieval precision.

Semantic chunking uses embedding similarity to detect topic shifts: when successive sentences become semantically distant, that is where to split. This works well for documents without explicit structure markers, such as long prose or transcripts, where structural chunking would be underdirected. The cost is additional compute during ingestion and the risk of splitting coherent passages too aggressively when the embedding model detects local variation rather than actual topic change.

Hierarchical chunking and context expansion

Hierarchical chunking maintains parent-child relationships between chunks: a document is split into sections, and each section into smaller chunks. The retrieval indexes the small chunks for precision, but when a small chunk is retrieved, the system can expand it to include its parent section for additional context.

This addresses a fundamental tension in chunk size selection: small chunks are more precise in retrieval because the embedding better represents the specific content, but they provide less context for generation. Large chunks provide more context but are worse for retrieval. Hierarchical chunking avoids the tradeoff by using small chunks for retrieval and larger units for generation.

The implementation complexity is higher than flat chunking. The ingestion pipeline needs to maintain and store the hierarchy, and the retrieval layer needs to expand retrieved chunks before passing them to the generation layer. This complexity is worth it for corpora where context around a retrieved passage is consistently important for question answering, such as technical manuals, legal documents, and complex policies.

Matching strategy to corpus

No single chunking strategy is optimal for all document types and all query types. The right approach is to characterise your corpus and your expected query distribution before choosing a strategy, or before defaulting to the tutorial option.

Short, independent documents such as FAQs, policy summaries, and product specifications often work well with simple structural chunking at the document level. Each document becomes a chunk, or is split into a small number of chunks at top level headings. The chunks are already coherent. Long, dense documents such as technical manuals, regulatory guidance, and case notes typically benefit from hierarchical chunking, where the precision of small chunks can be retrieved with the context of their parent sections.

The quickest way to evaluate a chunking strategy is to build a small evaluation dataset: a handful of representative queries with known answers in the corpus. Comparing retrieval precision across strategies usually makes it clear whether the strategy is finding the right content, and which failure modes are most common.

Measuring RAG Quality: Retrieval Evaluation Beyond Vibes →

Chunking is a retrieval engineering problem, not a setup step

The chunking strategy is often treated as a configuration detail: pick some parameters, move on. In practice, it is one of the architectural decisions with the greatest effect on retrieval quality, and one of the hardest to change after the index has been built and the system is live.

Getting it right requires understanding the corpus, testing against representative queries, and selecting a strategy that reflects the actual structure of the documents you are indexing. That is more work than accepting the tutorial default, and it is where much of the quality difference between RAG prototypes and production quality RAG systems comes from.