====== What Happens During the Ingestion Phase of RAG ====== The ingestion phase is the critical first step in any Retrieval-Augmented Generation pipeline. It transforms raw, unstructured data into semantically searchable vector embeddings stored in a database. Every downstream process -- retrieval quality, response accuracy, and system reliability -- depends on how well ingestion is executed. ((source [[https://dev.to/parth_sarthisharma_105e7/loaders-splitters-embeddings-how-bad-chunking-breaks-even-perfect-rag-systems-29j3|DEV Community - How Bad Chunking Breaks RAG Systems]])) ===== Document Loading ===== The ingestion pipeline begins by collecting and loading source documents from diverse origins into the processing system. ((source [[https://www.infoworld.com/article/2336099/retrieval-augmented-generation-step-by-step.html|InfoWorld - RAG Step by Step]])) ==== Common Data Sources ==== * Local files: PDF, DOCX, TXT, CSV, Markdown * Web pages and REST APIs * Databases and data warehouses * Cloud storage (S3, GCS, Azure Blob Storage) * Enterprise systems: SharePoint, Confluence, Notion, Slack ((source [[https://medium.com/@derrickryangiggs/rag-pipeline-deep-dive-ingestion-chunking-embedding-and-vector-search-abd3c8bfc177|Giggs - RAG Pipeline Deep Dive]])) ==== Document Loaders ==== Frameworks like LangChain provide specialized loaders for each format: PyPDFLoader for PDFs, WebBaseLoader for web pages, CSVLoader for tabular data, and DirectoryLoader for batch processing entire folders. ((source [[https://medium.com/%40atnoforgenai/the-complete-guide-to-rag-retrieval-augmented-generation-for-production-applications-fbfdc18b2757|ATNO - Complete Guide to RAG]])) Loaders are responsible for extracting text, reading raw sources, and attaching initial metadata. Common loader failures include missing or inconsistent metadata, navigation menus mixed into text content, headers and footers treated as body text, and PDFs with broken text ordering. ((source [[https://dev.to/parth_sarthisharma_105e7/loaders-splitters-embeddings-how-bad-chunking-breaks-even-perfect-rag-systems-29j3|DEV Community - How Bad Chunking Breaks RAG Systems]])) ===== Document Parsing ===== Parsing extracts structured content from the loaded documents, including text bodies, tables, headings, sections, and embedded elements like images. ((source [[https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/rag-time-journey-2-data-ingestion-and-search-practices-for-the-ultimate-rag-retr/4392157|Microsoft - RAG Data Ingestion and Search Practices]])) Pre-processing cleans and formats data to make it suitable for downstream steps. This may include tokenization into words or sub-words, removal of special characters and formatting artifacts, normalization of whitespace and encoding, and change tracking for incremental updates. ((source [[https://dev.to/busycaesar/rag-explained-ingestion-of-data-21pc|DEV Community - RAG Explained: Ingestion of Data]])) ===== Chunking Strategies ===== Large documents must be split into smaller, manageable **chunks** to optimize retrieval precision. If an entire document is embedded as one vector, important details get diluted and retrieval precision suffers drastically. ((source [[https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval|Brenndoerfer - Document Chunking for RAG]])) ==== Chunk Size ==== Typical chunk sizes range from 500 to 2000 tokens or characters. Smaller chunks (500-1000 tokens) enhance precision for fine-grained retrieval, while larger chunks (up to 4000 tokens) capture broader context but risk introducing noise. The optimal size should match the embedding model limits and be tested empirically for domain-specific performance. ((source [[https://www.infoworld.com/article/2336099/retrieval-augmented-generation-step-by-step.html|InfoWorld - RAG Step by Step]])) ==== Chunk Overlap ==== Overlap of 10-20% of the chunk size (typically 100-200 tokens for a 1000-token chunk) preserves context across chunk boundaries and reduces information loss from splits. Zero overlap suits simple texts, while 15-25% overlap aids semantic continuity in complex documents with cross-referencing content. ((source [[https://www.infoworld.com/article/2336099/retrieval-augmented-generation-step-by-step.html|InfoWorld - RAG Step by Step]])) ==== Chunking Methods ==== * **Fixed-size splitting**: Divide by character or token count -- simple but may break mid-sentence * **Recursive splitting**: Hierarchically split by paragraph, sentence, then character boundaries -- preserves structure * **Semantic chunking**: Group by meaning or topic boundaries -- best for context preservation but computationally expensive * **Structure-based splitting**: Use document headings, sections, and formatting as natural break points ((source [[https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval|Brenndoerfer - Document Chunking for RAG]])) ===== Embedding Generation ===== Chunks are converted into **dense vector embeddings** -- numerical arrays that capture semantic meaning -- using pre-trained transformer models. ((source [[https://www.infoworld.com/article/2336099/retrieval-augmented-generation-step-by-step.html|InfoWorld - RAG Step by Step]])) ==== Embedding Models ==== Popular embedding models include OpenAI text-embedding-ada-002 (1536 dimensions), Sentence-BERT/SBERT, E5, and Cohere Embed. The model choice affects the dimensionality, quality, and computational cost of embeddings. ((source [[https://www.chanl.ai/blog/rag-from-scratch-retrieval-augmented-generation-typescript-python|Chanl - RAG from Scratch]])) These models encode semantic meaning via transformer architectures, capturing contextual relationships so that conceptually similar text maps to nearby points in vector space. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]])) ==== Dimensionality and Optimization ==== Embeddings typically range from 768 to 1536 dimensions. Dimensionality reduction techniques like PCA or quantization can compress vectors by 50-90% with minimal accuracy loss, reducing storage requirements and improving query speed. ((source [[https://www.intersystems.com/resources/retrieval-augmented-generation/|InterSystems - RAG]])) ===== Vector Storage ===== The final step stores embeddings alongside original text chunks and metadata in a **vector database** optimized for similarity search. ((source [[https://www.infoworld.com/article/2336099/retrieval-augmented-generation-step-by-step.html|InfoWorld - RAG Step by Step]])) ==== What Gets Stored ==== * **Vector embedding**: The dense numerical representation of each chunk * **Original text**: The raw chunk content for retrieval and display * **Metadata**: Source file, page number, section heading, author, timestamps, and custom tags ((source [[https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/rag-time-journey-2-data-ingestion-and-search-practices-for-the-ultimate-rag-retr/4392157|Microsoft - RAG Data Ingestion and Search Practices]])) ==== Indexing for Fast Search ==== Vector databases use indexing algorithms like HNSW (graph-based, high recall) and IVF (partition-based, scalable) to enable sub-second approximate nearest neighbor searches across millions of vectors. ((source [[https://qdrant.tech/articles/what-is-rag-in-ai/|Qdrant - What is RAG in AI]])) ===== Best Practices ===== * Extract rich metadata (timestamps, authors, categories) during loading for downstream filtering * Validate chunk quality post-splitting by reviewing samples * Use domain-tuned embedding models when available for specialized content * Implement automated re-ingestion pipelines triggered by data changes * A/B test chunking strategies using retrieval accuracy metrics like hit rate and faithfulness * Handle tables, images, and code blocks with specialized processors rather than generic text splitting ((source [[https://dev.to/parth_sarthisharma_105e7/loaders-splitters-embeddings-how-bad-chunking-breaks-even-perfect-rag-systems-29j3|DEV Community - How Bad Chunking Breaks RAG Systems]])) ===== See Also ===== * [[retrieval_augmented_generation|Retrieval-Augmented Generation]] * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]] * [[agentic_rag|Agentic RAG]] * [[vector_db_comparison|Vector Database Comparison]] * [[rag_phases|Phases of a RAG System]] * [[rag_retrieval_phase|How Does the Retrieval Phase Work in RAG]] * [[vector_database_rag|Role of a Vector Database in AI RAG Architecture]] ===== References =====