Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
An end-to-end practical guide to building Retrieval Augmented Generation (RAG) pipelines. This guide walks through every step from document ingestion to answer generation, with working code for both LlamaIndex and LangChain.
Your RAG pipeline starts with loading source documents. Common formats include PDF, Markdown, HTML, and plain text.
LlamaIndex:
from llama_index.core import SimpleDirectoryReader # Loads all supported file types from a directory docs = SimpleDirectoryReader("data/").load_data() print(f"Loaded {len(docs)} documents")
LangChain:
from langchain_community.document_loaders import DirectoryLoader loader = DirectoryLoader("data/", glob="**/*.txt") docs = loader.load() print(f"Loaded {len(docs)} documents")
Chunking splits documents into retrievable units. The strategy you choose directly impacts retrieval quality.
| Strategy | How It Works | Best For | Trade-offs |
|---|---|---|---|
| Fixed-size | Uniform token/character splits with overlap | Simple documents, quick prototypes | Ignores semantic boundaries |
| Recursive | Hierarchical splits (paragraph โ sentence โ token) | Most general-purpose use cases | Good balance of speed and quality |
| Semantic | Groups sentences by embedding similarity | Long narratives, topical documents | Compute-intensive, slower |
| Parent-child | Small chunks for retrieval, linked to larger parent chunks | High-precision needs | More complex indexing |
Recommendation: Start with recursive chunking (size=1024, overlap=200) for most use cases.
LlamaIndex:
from llama_index.core.node_parser import SentenceSplitter splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200) nodes = splitter.get_nodes_from_documents(docs) print(f"Created {len(nodes)} chunks")
LangChain:
from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=200 ) splits = text_splitter.split_documents(docs) print(f"Created {len(splits)} chunks")
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings semantic_splitter = SemanticChunker( OpenAIEmbeddings(model="text-embedding-3-small"), breakpoint_threshold_type="percentile" ) semantic_chunks = semantic_splitter.split_documents(docs)
Embeddings convert text chunks into vectors for similarity search.
| Model | Provider | Dimensions | Cost | Best For |
|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | $0.02/M tokens | Prototypes, budget-friendly |
| text-embedding-3-large | OpenAI | 3072 | $0.13/M tokens | High-precision production |
| embed-v3 | Cohere | 1024 | $0.10/M tokens | Multilingual applications |
| BGE-large-en-v1.5 | Open-source (BAAI) | 1024 | Free (self-hosted) | On-prem, cost-sensitive |
| nomic-embed-text-v1.5 | Open-source (Nomic) | 768 | Free (self-hosted) | Edge devices, lightweight |
Recommendation: Use BGE for open-source / self-hosted. Use OpenAI text-embedding-3-small for quick iteration.
| Database | Type | Strengths | Best For |
|---|---|---|---|
| ChromaDB | Local/embedded | Simple API, free, in-memory | Prototyping (<1M vectors) |
| Pinecone | Managed cloud | Serverless scaling, hybrid search | Production, multi-tenant |
| Weaviate | Open-core | GraphQL API, modules, multi-modal | Semantic + keyword apps |
| Qdrant | Open-source | Fast HNSW, filtering, payloads | High-throughput, filtering |
| pgvector | Postgres extension | SQL joins, ACID compliance | Existing Postgres setups |
Retrieve the top-K most relevant chunks for a given query.
Pass the retrieved context plus the user query to an LLM.
pip install llama-index llama-index-embeddings-openai llama-index-vector-stores-chroma llama-index-postprocessor-cohere-rerank chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext from llama_index.core.node_parser import SentenceSplitter from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.llms.openai import OpenAI from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core.postprocessor import CohereRerank import chromadb # Step 1: Load documents docs = SimpleDirectoryReader("data/").load_data() # Step 2: Chunk with recursive splitter splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200) nodes = splitter.get_nodes_from_documents(docs) # Step 3-4: Embed and store in ChromaDB embed_model = OpenAIEmbedding(model="text-embedding-3-small") db = chromadb.PersistentClient(path="./chroma_db") chroma_collection = db.get_or_create_collection("rag_demo") vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex( nodes, embed_model=embed_model, storage_context=storage_context ) # Step 5-6: Query with reranking llm = OpenAI(model="gpt-4o-mini") reranker = CohereRerank(top_n=3, model="rerank-english-v3.0") query_engine = index.as_query_engine( llm=llm, node_postprocessors=[reranker] ) response = query_engine.query("What are the main concepts in my documents?") print(response)
pip install langchain langchain-openai langchain-chroma langchain-cohere
from langchain_community.document_loaders import DirectoryLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_chroma import Chroma from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import CohereRerank from langchain.chains import RetrievalQA # Step 1: Load documents loader = DirectoryLoader("data/", glob="**/*.txt") docs = loader.load() # Step 2: Chunk text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=200 ) splits = text_splitter.split_documents(docs) # Step 3-4: Embed and store in ChromaDB embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=splits, embedding=embeddings, persist_directory="./chroma_db" ) # Step 5: Retrieve with reranking compressor = CohereRerank(model="rerank-english-v3.0") retriever = ContextualCompressionRetriever( base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}), base_compressor=compressor ) # Step 6: Generate llm = ChatOpenAI(model="gpt-4o-mini") qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever) result = qa_chain.invoke({"query": "What are the main concepts?"}) print(result["result"])
Combine dense vector search with sparse keyword (BM25) search for better recall, especially on proper nouns and exact terms.
from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever # BM25 for keyword matching bm25_retriever = BM25Retriever.from_documents(splits) bm25_retriever.k = 10 # Vector for semantic matching vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # Combine with weighted ensemble hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.3, 0.7] # Favor semantic ) results = hybrid_retriever.invoke("specific technical term")
Generate a hypothetical answer first, embed that, then retrieve. Improves retrieval for vague or short queries by 5-15%.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings llm = ChatOpenAI(model="gpt-4o-mini") def hyde_retrieve(query, vectorstore, llm): # Generate hypothetical answer hypothetical = llm.invoke( f"Write a short paragraph answering: {query}" ).content # Embed the hypothetical answer for retrieval results = vectorstore.similarity_search(hypothetical, k=5) return results
Use small chunks for precise retrieval but return the larger parent chunk to the LLM for full context.
from llama_index.core.node_parser import SentenceSplitter from llama_index.core import VectorStoreIndex # Create small child chunks for retrieval child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50) child_nodes = child_splitter.get_nodes_from_documents(docs) # Each child node references its parent via metadata # LlamaIndex auto-retriever can fetch parent context index = VectorStoreIndex(child_nodes) retriever = index.as_retriever(similarity_top_k=5) # Configure to return parent nodes for generation
Reranking refines the top-K retrieved results (e.g., 20 โ 5) using cross-encoder models for more precise relevance scoring. This typically improves precision by 10-20%.
| Reranker | Type | Cost | Best For |
|---|---|---|---|
| Cohere Rerank v3 | API | $2/M docs | Production, multilingual |
| ms-marco-MiniLM-L-6-v2 | Open-source | Free | Self-hosted, <100 docs/query |
Both approaches require API keys set as environment variables:
export OPENAI_API_KEY="your-openai-key" export COHERE_API_KEY="your-cohere-key"
rag retrieval-augmented-generation llamaindex langchain vector-database embeddings how-to