====== How to Build a RAG Pipeline ======
An end-to-end practical guide to building Retrieval Augmented Generation (RAG) pipelines. This guide walks through every step from document ingestion to answer generation, with working code for both LlamaIndex and LangChain.
===== Pipeline Overview =====
graph LR
A[📄 Documents] --> B[Parse & Chunk]
B --> C[Embed]
C --> D[Vector DB]
E[🔍 User Query] --> F[Embed Query]
F --> D
D --> G[Retrieve Top-K]
G --> H[Rerank]
H --> I[LLM Generate]
I --> J[✅ Answer]
===== Step 1: Choose and Load Documents =====
Your RAG pipeline starts with loading source documents. Common formats include PDF, Markdown, HTML, and plain text.
**LlamaIndex:**
from llama_index.core import SimpleDirectoryReader
# Loads all supported file types from a directory
docs = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(docs)} documents")
**LangChain:**
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
print(f"Loaded {len(docs)} documents")
===== Step 2: Parse and Chunk =====
Chunking splits documents into retrievable units. The strategy you choose directly impacts retrieval quality.
==== Chunking Strategies Compared ====
^ Strategy ^ How It Works ^ Best For ^ Trade-offs ^
| **Fixed-size** | Uniform token/character splits with overlap | Simple documents, quick prototypes | Ignores semantic boundaries |
| **Recursive** | Hierarchical splits (paragraph → sentence → token) | Most general-purpose use cases | Good balance of speed and quality |
| **Semantic** | Groups sentences by embedding similarity | Long narratives, topical documents | Compute-intensive, slower |
| **Parent-child** | Small chunks for retrieval, linked to larger parent chunks | High-precision needs | More complex indexing |
**Recommendation:** Start with recursive chunking (size=1024, overlap=200) for most use cases.
==== Recursive Chunking (Recommended Default) ====
**LlamaIndex:**
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
print(f"Created {len(nodes)} chunks")
**LangChain:**
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1024,
chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
print(f"Created {len(splits)} chunks")
==== Semantic Chunking ====
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
semantic_splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)
===== Step 3: Choose an Embedding Model =====
Embeddings convert text chunks into vectors for similarity search.
^ Model ^ Provider ^ Dimensions ^ Cost ^ Best For ^
| text-embedding-3-small | OpenAI | 1536 | $0.02/M tokens | Prototypes, budget-friendly |
| text-embedding-3-large | OpenAI | 3072 | $0.13/M tokens | High-precision production |
| embed-v3 | Cohere | 1024 | $0.10/M tokens | Multilingual applications |
| BGE-large-en-v1.5 | Open-source (BAAI) | 1024 | Free (self-hosted) | On-prem, cost-sensitive |
| nomic-embed-text-v1.5 | Open-source (Nomic) | 768 | Free (self-hosted) | Edge devices, lightweight |
**Recommendation:** Use BGE for open-source / self-hosted. Use OpenAI text-embedding-3-small for quick iteration.
===== Step 4: Store in a Vector Database =====
^ Database ^ Type ^ Strengths ^ Best For ^
| ChromaDB | Local/embedded | Simple API, free, in-memory | Prototyping (<1M vectors) |
| Pinecone | Managed cloud | Serverless scaling, hybrid search | Production, multi-tenant |
| Weaviate | Open-core | GraphQL API, modules, multi-modal | Semantic + keyword apps |
| Qdrant | Open-source | Fast HNSW, filtering, payloads | High-throughput, filtering |
| pgvector | Postgres extension | SQL joins, ACID compliance | Existing Postgres setups |
===== Step 5: Query and Retrieve =====
Retrieve the top-K most relevant chunks for a given query.
===== Step 6: Generate an Answer =====
Pass the retrieved context plus the user query to an LLM.
===== Full Pipeline: LlamaIndex Approach =====
pip install llama-index llama-index-embeddings-openai llama-index-vector-stores-chroma llama-index-postprocessor-cohere-rerank chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.postprocessor import CohereRerank
import chromadb
# Step 1: Load documents
docs = SimpleDirectoryReader("data/").load_data()
# Step 2: Chunk with recursive splitter
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
# Step 3-4: Embed and store in ChromaDB
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("rag_demo")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes,
embed_model=embed_model,
storage_context=storage_context
)
# Step 5-6: Query with reranking
llm = OpenAI(model="gpt-4o-mini")
reranker = CohereRerank(top_n=3, model="rerank-english-v3.0")
query_engine = index.as_query_engine(
llm=llm,
node_postprocessors=[reranker]
)
response = query_engine.query("What are the main concepts in my documents?")
print(response)
===== Full Pipeline: LangChain Approach =====
pip install langchain langchain-openai langchain-chroma langchain-cohere
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.chains import RetrievalQA
# Step 1: Load documents
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
# Step 2: Chunk
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1024,
chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
# Step 3-4: Embed and store in ChromaDB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Step 5: Retrieve with reranking
compressor = CohereRerank(model="rerank-english-v3.0")
retriever = ContextualCompressionRetriever(
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
base_compressor=compressor
)
# Step 6: Generate
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = qa_chain.invoke({"query": "What are the main concepts?"})
print(result["result"])
===== Advanced Techniques =====
==== Hybrid Search (Semantic + Keyword) ====
Combine dense vector search with sparse keyword (BM25) search for better recall, especially on proper nouns and exact terms.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 10
# Vector for semantic matching
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Combine with weighted ensemble
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7] # Favor semantic
)
results = hybrid_retriever.invoke("specific technical term")
==== HyDE (Hypothetical Document Embeddings) ====
Generate a hypothetical answer first, embed that, then retrieve. Improves retrieval for vague or short queries by 5-15%.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o-mini")
def hyde_retrieve(query, vectorstore, llm):
# Generate hypothetical answer
hypothetical = llm.invoke(
f"Write a short paragraph answering: {query}"
).content
# Embed the hypothetical answer for retrieval
results = vectorstore.similarity_search(hypothetical, k=5)
return results
==== Parent-Child Chunking ====
Use small chunks for precise retrieval but return the larger parent chunk to the LLM for full context.
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
# Create small child chunks for retrieval
child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)
child_nodes = child_splitter.get_nodes_from_documents(docs)
# Each child node references its parent via metadata
# LlamaIndex auto-retriever can fetch parent context
index = VectorStoreIndex(child_nodes)
retriever = index.as_retriever(similarity_top_k=5)
# Configure to return parent nodes for generation
===== Reranking =====
Reranking refines the top-K retrieved results (e.g., 20 → 5) using cross-encoder models for more precise relevance scoring. This typically improves precision by 10-20%.
^ Reranker ^ Type ^ Cost ^ Best For ^
| Cohere Rerank v3 | API | $2/M docs | Production, multilingual |
| ms-marco-MiniLM-L-6-v2 | Open-source | Free | Self-hosted, <100 docs/query |
===== Environment Variables =====
Both approaches require API keys set as environment variables:
export OPENAI_API_KEY="your-openai-key"
export COHERE_API_KEY="your-cohere-key"
===== Decision Guide =====
graph TD
A[Starting a RAG pipeline?] --> B{Prototype or Production?}
B -->|Prototype| C[ChromaDB + OpenAI small]
B -->|Production| D{Scale?}
D -->|< 1M docs| E[Qdrant + BGE]
D -->|> 1M docs| F[Pinecone + OpenAI large]
C --> G{Need multilingual?}
E --> G
F --> G
G -->|Yes| H[Use Cohere embed-v3]
G -->|No| I[Keep current embeddings]
H --> J[Add Cohere Rerank]
I --> J
J --> K[Deploy & Monitor]
===== See Also =====
* [[how_to_add_memory_to_an_agent|How to Add Memory to an Agent]]
* [[how_to_evaluate_an_agent|How to Evaluate an Agent]]
* [[how_to_use_mcp|How to Use MCP]]
{{tag>rag retrieval-augmented-generation llamaindex langchain vector-database embeddings how-to}}