====== How to Build a RAG Pipeline ====== An end-to-end practical guide to building Retrieval Augmented Generation (RAG) pipelines. This guide walks through every step from document ingestion to answer generation, with working code for both LlamaIndex and LangChain. ===== Pipeline Overview ===== graph LR A[📄 Documents] --> B[Parse & Chunk] B --> C[Embed] C --> D[Vector DB] E[🔍 User Query] --> F[Embed Query] F --> D D --> G[Retrieve Top-K] G --> H[Rerank] H --> I[LLM Generate] I --> J[✅ Answer] ===== Step 1: Choose and Load Documents ===== Your RAG pipeline starts with loading source documents. Common formats include PDF, Markdown, HTML, and plain text. **LlamaIndex:** from llama_index.core import SimpleDirectoryReader # Loads all supported file types from a directory docs = SimpleDirectoryReader("data/").load_data() print(f"Loaded {len(docs)} documents") **LangChain:** from langchain_community.document_loaders import DirectoryLoader loader = DirectoryLoader("data/", glob="**/*.txt") docs = loader.load() print(f"Loaded {len(docs)} documents") ===== Step 2: Parse and Chunk ===== Chunking splits documents into retrievable units. The strategy you choose directly impacts retrieval quality. ==== Chunking Strategies Compared ==== ^ Strategy ^ How It Works ^ Best For ^ Trade-offs ^ | **Fixed-size** | Uniform token/character splits with overlap | Simple documents, quick prototypes | Ignores semantic boundaries | | **Recursive** | Hierarchical splits (paragraph → sentence → token) | Most general-purpose use cases | Good balance of speed and quality | | **Semantic** | Groups sentences by embedding similarity | Long narratives, topical documents | Compute-intensive, slower | | **Parent-child** | Small chunks for retrieval, linked to larger parent chunks | High-precision needs | More complex indexing | **Recommendation:** Start with recursive chunking (size=1024, overlap=200) for most use cases. ==== Recursive Chunking (Recommended Default) ==== **LlamaIndex:** from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200) nodes = splitter.get_nodes_from_documents(docs) print(f"Created {len(nodes)} chunks") **LangChain:** from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=200 ) splits = text_splitter.split_documents(docs) print(f"Created {len(splits)} chunks") ==== Semantic Chunking ==== from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings semantic_splitter = SemanticChunker( OpenAIEmbeddings(model="text-embedding-3-small"), breakpoint_threshold_type="percentile" ) semantic_chunks = semantic_splitter.split_documents(docs) ===== Step 3: Choose an Embedding Model ===== Embeddings convert text chunks into vectors for similarity search. ^ Model ^ Provider ^ Dimensions ^ Cost ^ Best For ^ | text-embedding-3-small | OpenAI | 1536 | $0.02/M tokens | Prototypes, budget-friendly | | text-embedding-3-large | OpenAI | 3072 | $0.13/M tokens | High-precision production | | embed-v3 | Cohere | 1024 | $0.10/M tokens | Multilingual applications | | BGE-large-en-v1.5 | Open-source (BAAI) | 1024 | Free (self-hosted) | On-prem, cost-sensitive | | nomic-embed-text-v1.5 | Open-source (Nomic) | 768 | Free (self-hosted) | Edge devices, lightweight | **Recommendation:** Use BGE for open-source / self-hosted. Use OpenAI text-embedding-3-small for quick iteration. ===== Step 4: Store in a Vector Database ===== ^ Database ^ Type ^ Strengths ^ Best For ^ | ChromaDB | Local/embedded | Simple API, free, in-memory | Prototyping (<1M vectors) | | Pinecone | Managed cloud | Serverless scaling, hybrid search | Production, multi-tenant | | Weaviate | Open-core | GraphQL API, modules, multi-modal | Semantic + keyword apps | | Qdrant | Open-source | Fast HNSW, filtering, payloads | High-throughput, filtering | | pgvector | Postgres extension | SQL joins, ACID compliance | Existing Postgres setups | ===== Step 5: Query and Retrieve ===== Retrieve the top-K most relevant chunks for a given query. ===== Step 6: Generate an Answer ===== Pass the retrieved context plus the user query to an LLM. ===== Full Pipeline: LlamaIndex Approach ===== pip install llama-index llama-index-embeddings-openai llama-index-vector-stores-chroma llama-index-postprocessor-cohere-rerank chromadb from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext from llama_index.core.node_parser import SentenceSplitter from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core.postprocessor import CohereRerank import chromadb # Step 1: Load documents docs = SimpleDirectoryReader("data/").load_data() # Step 2: Chunk with recursive splitter splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200) nodes = splitter.get_nodes_from_documents(docs) # Step 3-4: Embed and store in ChromaDB embed_model = OpenAIEmbedding(model="text-embedding-3-small") db = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db.get_or_create_collection("rag_demo") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex( nodes, embed_model=embed_model, storage_context=storage_context ) # Step 5-6: Query with reranking llm = OpenAI(model="gpt-4o-mini") reranker = CohereRerank(top_n=3, model="rerank-english-v3.0") query_engine = index.as_query_engine( llm=llm, node_postprocessors=[reranker] ) response = query_engine.query("What are the main concepts in my documents?") print(response) ===== Full Pipeline: LangChain Approach ===== pip install langchain langchain-openai langchain-chroma langchain-cohere from langchain_community.document_loaders import DirectoryLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_chroma import Chroma from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import CohereRerank from langchain.chains import RetrievalQA # Step 1: Load documents loader = DirectoryLoader("data/", glob="**/*.txt") docs = loader.load() # Step 2: Chunk text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=200 ) splits = text_splitter.split_documents(docs) # Step 3-4: Embed and store in ChromaDB embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=splits, embedding=embeddings, persist_directory="./chroma_db" ) # Step 5: Retrieve with reranking compressor = CohereRerank(model="rerank-english-v3.0") retriever = ContextualCompressionRetriever( base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}), base_compressor=compressor ) # Step 6: Generate llm = ChatOpenAI(model="gpt-4o-mini") qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever) result = qa_chain.invoke({"query": "What are the main concepts?"}) print(result["result"]) ===== Advanced Techniques ===== ==== Hybrid Search (Semantic + Keyword) ==== Combine dense vector search with sparse keyword (BM25) search for better recall, especially on proper nouns and exact terms. from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever # BM25 for keyword matching bm25_retriever = BM25Retriever.from_documents(splits) bm25_retriever.k = 10 # Vector for semantic matching vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # Combine with weighted ensemble hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.3, 0.7] # Favor semantic ) results = hybrid_retriever.invoke("specific technical term") ==== HyDE (Hypothetical Document Embeddings) ==== Generate a hypothetical answer first, embed that, then retrieve. Improves retrieval for vague or short queries by 5-15%. from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini") def hyde_retrieve(query, vectorstore, llm): # Generate hypothetical answer hypothetical = llm.invoke( f"Write a short paragraph answering: {query}" ).content # Embed the hypothetical answer for retrieval results = vectorstore.similarity_search(hypothetical, k=5) return results ==== Parent-Child Chunking ==== Use small chunks for precise retrieval but return the larger parent chunk to the LLM for full context. from llama_index.core.node_parser import SentenceSplitter from llama_index.core import VectorStoreIndex # Create small child chunks for retrieval child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50) child_nodes = child_splitter.get_nodes_from_documents(docs) # Each child node references its parent via metadata # LlamaIndex auto-retriever can fetch parent context index = VectorStoreIndex(child_nodes) retriever = index.as_retriever(similarity_top_k=5) # Configure to return parent nodes for generation ===== Reranking ===== Reranking refines the top-K retrieved results (e.g., 20 → 5) using cross-encoder models for more precise relevance scoring. This typically improves precision by 10-20%. ^ Reranker ^ Type ^ Cost ^ Best For ^ | Cohere Rerank v3 | API | $2/M docs | Production, multilingual | | ms-marco-MiniLM-L-6-v2 | Open-source | Free | Self-hosted, <100 docs/query | ===== Environment Variables ===== Both approaches require API keys set as environment variables: export OPENAI_API_KEY="your-openai-key" export COHERE_API_KEY="your-cohere-key" ===== Decision Guide ===== graph TD A[Starting a RAG pipeline?] --> B{Prototype or Production?} B -->|Prototype| C[ChromaDB + OpenAI small] B -->|Production| D{Scale?} D -->|< 1M docs| E[Qdrant + BGE] D -->|> 1M docs| F[Pinecone + OpenAI large] C --> G{Need multilingual?} E --> G F --> G G -->|Yes| H[Use Cohere embed-v3] G -->|No| I[Keep current embeddings] H --> J[Add Cohere Rerank] I --> J J --> K[Deploy & Monitor] ===== See Also ===== * [[how_to_add_memory_to_an_agent|How to Add Memory to an Agent]] * [[how_to_evaluate_an_agent|How to Evaluate an Agent]] * [[how_to_use_mcp|How to Use MCP]] {{tag>rag retrieval-augmented-generation llamaindex langchain vector-database embeddings how-to}}