Document Ingestion Pipelines for RAG: Getting the Foundation Right

In most RAG tutorials, document ingestion is step one and takes about three lines of code: load the file, split it, embed it. In production, it is an ongoing engineering concern that determines the quality ceiling of everything downstream. The retrieval can only return what is in the index. The generation can only work with what the retrieval returns. The ingestion pipeline is where the quality of the whole system is determined.

The complexity of production ingestion comes from the documents themselves. Real enterprise document corpora contain a variety of formats, quality levels, and structural conventions that require different handling. That handling needs to be reliable, repeatable, and maintainable under change.

Getting this right early is substantially cheaper than retrofitting a better pipeline after the system is live with a degraded knowledge base.

What real document corpora look like

Tutorial RAG systems ingest clean PDFs. Production document corpora contain scanned PDFs from ten years ago, processed with OCR at varying quality, with headers that got mangled and tables that became uninterpretable text. They contain Word documents with tracked changes, comments, and formatting that exists to make the document look right on screen but encodes no semantic information. They contain Excel files where the most important data is in a table, and the table's meaning depends on column headers that are visually obvious but semantically ambiguous when converted to plain text.

Each document type needs handling that reflects how information is actually structured in that type. A PDF parser that extracts text without understanding page layout will produce content where a footnote at the bottom of the page appears as a paragraph break in the middle of a sentence. An HTML parser that does not strip navigation menus and footers will embed repeated boilerplate into the content of every page it processes.

The appropriate handling for each format in your corpus is worth investing in specifically. Generic document loaders handle the average case. Real corpora are full of difficult cases, and those are the ones that produce bad retrievals.

OCR is not a solved problem

For corpora that include scanned documents, OCR quality is the dominant factor in retrieval quality on those documents. A scan at 72 DPI produces text that is barely recognisable. A scan at 300 DPI with good contrast produces text that is accurate enough to be useful. A scan of a scan loses generational fidelity and may not be recoverable at any quality setting.

This means the ingestion pipeline needs to make OCR quality visible, rather than execute OCR on everything and hope. A document that produces very low confidence OCR outputs should be flagged for review, not silently ingested as if the OCR succeeded. Otherwise the index contains documents that will be retrieved but whose content is garbled, producing answers that look plausible but are drawn from corrupted source material.

For organisations with significant scanned document backlogs, the realistic options are: improving the scanning process for new documents, running the historical backlog through a stronger OCR pipeline with manual review of uncertain outputs, or accepting that scanned document retrieval will be limited until the quality floor improves. Any of these is a legitimate choice. Making no explicit choice, then discovering the quality problem through user feedback, is not.

Making ingestion reliable and repeatable

A production ingestion pipeline is not a script you run once. Documents change. New documents are added. Old documents are superseded. The pipeline needs to run continuously or on a defined schedule, detect what has changed, and update the index accordingly without manual intervention.

That requires tracking what has been ingested and in what state. A document that has been ingested once should not be fully processed again on every run unless it has changed. A document that has been deleted from the source should have its embeddings removed from the index, not left to be retrieved as if it still exists.

Failure handling in ingestion pipelines fails silently in ways that other parts of the system do not. A document that fails to ingest produces no error to the end user. It simply does not appear in retrieval results. Without monitoring that makes failed ingestion visible, you can build up a growing gap between what documents exist and what the index knows about without realising it.

Metadata is not optional

Every chunk in the vector index should carry metadata: at minimum, a reference to the source document, a location within that document (page number, section heading, or both), and any access control tags required for multi-tenant retrieval.

This metadata is what makes citation useful rather than decorative. A retrieval that returns a passage from 'Q4 Procurement Policy.pdf' is better than one that returns an anonymous chunk. A retrieval that returns 'Section 4.2 of Q4 Procurement Policy.pdf (page 12)' is better still. Users who can verify the source develop trust in the system. Users who receive answers with no usable attribution develop uncertainty even when the answers are correct.

Metadata also enables the filtering required for tenant access control, so query results only include chunks the querying user is authorised to see. This filtering needs to be applied at the retrieval layer, which means the metadata must be present and correctly set at ingestion time. Retrofitting it to a corpus that was ingested without it means processing the entire corpus again.

The unglamorous part that determines everything

Ingestion is the part of a RAG system that nobody wants to show in a demo. There is nothing impressive about a pipeline that parses PDFs correctly and handles OCR failures. The impressive part is the retrieval and generation.

But retrieval and generation quality is bounded by what is in the index, and what is in the index is determined by the ingestion pipeline. Building it properly, with reliability, monitoring, format awareness, and maintainability, is what allows the impressive parts to work in production as well as in a demo against five clean documents.