Table of Contents

What Happens During the Ingestion Phase of RAG

The ingestion phase is the critical first step in any Retrieval-Augmented Generation pipeline. It transforms raw, unstructured data into semantically searchable vector embeddings stored in a database. Every downstream process – retrieval quality, response accuracy, and system reliability – depends on how well ingestion is executed. 1)

Document Loading

The ingestion pipeline begins by collecting and loading source documents from diverse origins into the processing system. 2)

Common Data Sources

Document Loaders

Frameworks like LangChain provide specialized loaders for each format: PyPDFLoader for PDFs, WebBaseLoader for web pages, CSVLoader for tabular data, and DirectoryLoader for batch processing entire folders. 4) Loaders are responsible for extracting text, reading raw sources, and attaching initial metadata. Common loader failures include missing or inconsistent metadata, navigation menus mixed into text content, headers and footers treated as body text, and PDFs with broken text ordering. 5)

Document Parsing

Parsing extracts structured content from the loaded documents, including text bodies, tables, headings, sections, and embedded elements like images. 6)

Pre-processing cleans and formats data to make it suitable for downstream steps. This may include tokenization into words or sub-words, removal of special characters and formatting artifacts, normalization of whitespace and encoding, and change tracking for incremental updates. 7)

Chunking Strategies

Large documents must be split into smaller, manageable chunks to optimize retrieval precision. If an entire document is embedded as one vector, important details get diluted and retrieval precision suffers drastically. 8)

Chunk Size

Typical chunk sizes range from 500 to 2000 tokens or characters. Smaller chunks (500-1000 tokens) enhance precision for fine-grained retrieval, while larger chunks (up to 4000 tokens) capture broader context but risk introducing noise. The optimal size should match the embedding model limits and be tested empirically for domain-specific performance. 9)

Chunk Overlap

Overlap of 10-20% of the chunk size (typically 100-200 tokens for a 1000-token chunk) preserves context across chunk boundaries and reduces information loss from splits. Zero overlap suits simple texts, while 15-25% overlap aids semantic continuity in complex documents with cross-referencing content. 10)

Chunking Methods

Embedding Generation

Chunks are converted into dense vector embeddings – numerical arrays that capture semantic meaning – using pre-trained transformer models. 12)

Embedding Models

Popular embedding models include OpenAI text-embedding-ada-002 (1536 dimensions), Sentence-BERT/SBERT, E5, and Cohere Embed. The model choice affects the dimensionality, quality, and computational cost of embeddings. 13) These models encode semantic meaning via transformer architectures, capturing contextual relationships so that conceptually similar text maps to nearby points in vector space. 14)

Dimensionality and Optimization

Embeddings typically range from 768 to 1536 dimensions. Dimensionality reduction techniques like PCA or quantization can compress vectors by 50-90% with minimal accuracy loss, reducing storage requirements and improving query speed. 15)

Vector Storage

The final step stores embeddings alongside original text chunks and metadata in a vector database optimized for similarity search. 16)

What Gets Stored

Vector databases use indexing algorithms like HNSW (graph-based, high recall) and IVF (partition-based, scalable) to enable sub-second approximate nearest neighbor searches across millions of vectors. 18)

Best Practices

See Also

References