Table of Contents

How to Build a RAG Pipeline

An end-to-end practical guide to building Retrieval Augmented Generation (RAG) pipelines. This guide walks through every step from document ingestion to answer generation, with working code for both LlamaIndex and LangChain.

Pipeline Overview

graph LR A[📄 Documents] --> B[Parse & Chunk] B --> C[Embed] C --> D[Vector DB] E[🔍 User Query] --> F[Embed Query] F --> D D --> G[Retrieve Top-K] G --> H[Rerank] H --> I[LLM Generate] I --> J[✅ Answer]

Step 1: Choose and Load Documents

Your RAG pipeline starts with loading source documents. Common formats include PDF, Markdown, HTML, and plain text.

LlamaIndex:

from llama_index.core import SimpleDirectoryReader
 
# Loads all supported file types from a directory
docs = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(docs)} documents")

LangChain:

from langchain_community.document_loaders import DirectoryLoader
 
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
print(f"Loaded {len(docs)} documents")

Step 2: Parse and Chunk

Chunking splits documents into retrievable units. The strategy you choose directly impacts retrieval quality.

Chunking Strategies Compared

Strategy How It Works Best For Trade-offs
Fixed-size Uniform token/character splits with overlap Simple documents, quick prototypes Ignores semantic boundaries
Recursive Hierarchical splits (paragraph → sentence → token) Most general-purpose use cases Good balance of speed and quality
Semantic Groups sentences by embedding similarity Long narratives, topical documents Compute-intensive, slower
Parent-child Small chunks for retrieval, linked to larger parent chunks High-precision needs More complex indexing

Recommendation: Start with recursive chunking (size=1024, overlap=200) for most use cases.

LlamaIndex:

from llama_index.core.node_parser import SentenceSplitter
 
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
print(f"Created {len(nodes)} chunks")

LangChain:

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
print(f"Created {len(splits)} chunks")

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)

Step 3: Choose an Embedding Model

Embeddings convert text chunks into vectors for similarity search.

Model Provider Dimensions Cost Best For
text-embedding-3-small OpenAI 1536 $0.02/M tokens Prototypes, budget-friendly
text-embedding-3-large OpenAI 3072 $0.13/M tokens High-precision production
embed-v3 Cohere 1024 $0.10/M tokens Multilingual applications
BGE-large-en-v1.5 Open-source (BAAI) 1024 Free (self-hosted) On-prem, cost-sensitive
nomic-embed-text-v1.5 Open-source (Nomic) 768 Free (self-hosted) Edge devices, lightweight

Recommendation: Use BGE for open-source / self-hosted. Use OpenAI text-embedding-3-small for quick iteration.

Step 4: Store in a Vector Database

Database Type Strengths Best For
ChromaDB Local/embedded Simple API, free, in-memory Prototyping (<1M vectors)
Pinecone Managed cloud Serverless scaling, hybrid search Production, multi-tenant
Weaviate Open-core GraphQL API, modules, multi-modal Semantic + keyword apps
Qdrant Open-source Fast HNSW, filtering, payloads High-throughput, filtering
pgvector Postgres extension SQL joins, ACID compliance Existing Postgres setups

Step 5: Query and Retrieve

Retrieve the top-K most relevant chunks for a given query.

Step 6: Generate an Answer

Pass the retrieved context plus the user query to an LLM.

Full Pipeline: LlamaIndex Approach

pip install llama-index llama-index-embeddings-openai llama-index-vector-stores-chroma llama-index-postprocessor-cohere-rerank chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.postprocessor import CohereRerank
import chromadb
 
# Step 1: Load documents
docs = SimpleDirectoryReader("data/").load_data()
 
# Step 2: Chunk with recursive splitter
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
 
# Step 3-4: Embed and store in ChromaDB
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("rag_demo")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
    nodes,
    embed_model=embed_model,
    storage_context=storage_context
)
 
# Step 5-6: Query with reranking
llm = OpenAI(model="gpt-4o-mini")
reranker = CohereRerank(top_n=3, model="rerank-english-v3.0")
query_engine = index.as_query_engine(
    llm=llm,
    node_postprocessors=[reranker]
)
 
response = query_engine.query("What are the main concepts in my documents?")
print(response)

Full Pipeline: LangChain Approach

pip install langchain langchain-openai langchain-chroma langchain-cohere
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.chains import RetrievalQA
 
# Step 1: Load documents
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
 
# Step 2: Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
 
# Step 3-4: Embed and store in ChromaDB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
 
# Step 5: Retrieve with reranking
compressor = CohereRerank(model="rerank-english-v3.0")
retriever = ContextualCompressionRetriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
    base_compressor=compressor
)
 
# Step 6: Generate
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = qa_chain.invoke({"query": "What are the main concepts?"})
print(result["result"])

Advanced Techniques

Hybrid Search (Semantic + Keyword)

Combine dense vector search with sparse keyword (BM25) search for better recall, especially on proper nouns and exact terms.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
 
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 10
 
# Vector for semantic matching
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
 
# Combine with weighted ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # Favor semantic
)
 
results = hybrid_retriever.invoke("specific technical term")

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed that, then retrieve. Improves retrieval for vague or short queries by 5-15%.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
 
llm = ChatOpenAI(model="gpt-4o-mini")
 
def hyde_retrieve(query, vectorstore, llm):
    # Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a short paragraph answering: {query}"
    ).content
 
    # Embed the hypothetical answer for retrieval
    results = vectorstore.similarity_search(hypothetical, k=5)
    return results

Parent-Child Chunking

Use small chunks for precise retrieval but return the larger parent chunk to the LLM for full context.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
 
# Create small child chunks for retrieval
child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)
child_nodes = child_splitter.get_nodes_from_documents(docs)
 
# Each child node references its parent via metadata
# LlamaIndex auto-retriever can fetch parent context
index = VectorStoreIndex(child_nodes)
retriever = index.as_retriever(similarity_top_k=5)
# Configure to return parent nodes for generation

Reranking

Reranking refines the top-K retrieved results (e.g., 20 → 5) using cross-encoder models for more precise relevance scoring. This typically improves precision by 10-20%.

Reranker Type Cost Best For
Cohere Rerank v3 API $2/M docs Production, multilingual
ms-marco-MiniLM-L-6-v2 Open-source Free Self-hosted, <100 docs/query

Environment Variables

Both approaches require API keys set as environment variables:

export OPENAI_API_KEY="your-openai-key"
export COHERE_API_KEY="your-cohere-key"

Decision Guide

graph TD A[Starting a RAG pipeline?] --> B{Prototype or Production?} B -->|Prototype| C[ChromaDB + OpenAI small] B -->|Production| D{Scale?} D -->|< 1M docs| E[Qdrant + BGE] D -->|> 1M docs| F[Pinecone + OpenAI large] C --> G{Need multilingual?} E --> G F --> G G -->|Yes| H[Use Cohere embed-v3] G -->|No| I[Keep current embeddings] H --> J[Add Cohere Rerank] I --> J J --> K[Deploy & Monitor]

See Also

rag retrieval-augmented-generation llamaindex langchain vector-database embeddings how-to