How to Build a RAG Pipeline

An end-to-end practical guide to building Retrieval Augmented Generation (RAG) pipelines. This guide walks through every step from document ingestion to answer generation, with working code for both LlamaIndex and LangChain.

Pipeline Overview

graph LR A[📄 Documents] --> B[Parse & Chunk] B --> C[Embed] C --> D[Vector DB] E[🔍 User Query] --> F[Embed Query] F --> D D --> G[Retrieve Top-K] G --> H[Rerank] H --> I[LLM Generate] I --> J[✅ Answer]

Step 1: Choose and Load Documents

Your RAG pipeline starts with loading source documents. Common formats include PDF, Markdown, HTML, and plain text.

LlamaIndex:

from llama_index.core import SimpleDirectoryReader
 
# Loads all supported file types from a directory
docs = SimpleDirectoryReader("data/").load_data()
print(f"Loaded {len(docs)} documents")

LangChain:

from langchain_community.document_loaders import DirectoryLoader
 
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
print(f"Loaded {len(docs)} documents")

Step 2: Parse and Chunk

Chunking splits documents into retrievable units. The strategy you choose directly impacts retrieval quality.

Chunking Strategies Compared

Strategy	How It Works	Best For	Trade-offs
Fixed-size	Uniform token/character splits with overlap	Simple documents, quick prototypes	Ignores semantic boundaries
Recursive	Hierarchical splits (paragraph → sentence → token)	Most general-purpose use cases	Good balance of speed and quality
Semantic	Groups sentences by embedding similarity	Long narratives, topical documents	Compute-intensive, slower
Parent-child	Small chunks for retrieval, linked to larger parent chunks	High-precision needs	More complex indexing

Recommendation: Start with recursive chunking (size=1024, overlap=200) for most use cases.

Recursive Chunking (Recommended Default)

LlamaIndex:

from llama_index.core.node_parser import SentenceSplitter
 
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
print(f"Created {len(nodes)} chunks")

LangChain:

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
print(f"Created {len(splits)} chunks")

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)

Step 3: Choose an Embedding Model

Embeddings convert text chunks into vectors for similarity search.

Model	Provider	Dimensions	Cost	Best For
text-embedding-3-small	OpenAI	1536	$0.02/M tokens	Prototypes, budget-friendly
text-embedding-3-large	OpenAI	3072	$0.13/M tokens	High-precision production
embed-v3	Cohere	1024	$0.10/M tokens	Multilingual applications
BGE-large-en-v1.5	Open-source (BAAI)	1024	Free (self-hosted)	On-prem, cost-sensitive
nomic-embed-text-v1.5	Open-source (Nomic)	768	Free (self-hosted)	Edge devices, lightweight

Recommendation: Use BGE for open-source / self-hosted. Use OpenAI text-embedding-3-small for quick iteration.

Step 4: Store in a Vector Database

Database	Type	Strengths	Best For
ChromaDB	Local/embedded	Simple API, free, in-memory	Prototyping (<1M vectors)
Pinecone	Managed cloud	Serverless scaling, hybrid search	Production, multi-tenant
Weaviate	Open-core	GraphQL API, modules, multi-modal	Semantic + keyword apps
Qdrant	Open-source	Fast HNSW, filtering, payloads	High-throughput, filtering
pgvector	Postgres extension	SQL joins, ACID compliance	Existing Postgres setups

Step 5: Query and Retrieve

Retrieve the top-K most relevant chunks for a given query.

Step 6: Generate an Answer

Pass the retrieved context plus the user query to an LLM.

Full Pipeline: LlamaIndex Approach

pip install llama-index llama-index-embeddings-openai llama-index-vector-stores-chroma llama-index-postprocessor-cohere-rerank chromadb

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.postprocessor import CohereRerank
import chromadb
 
# Step 1: Load documents
docs = SimpleDirectoryReader("data/").load_data()
 
# Step 2: Chunk with recursive splitter
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs)
 
# Step 3-4: Embed and store in ChromaDB
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("rag_demo")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
    nodes,
    embed_model=embed_model,
    storage_context=storage_context
)
 
# Step 5-6: Query with reranking
llm = OpenAI(model="gpt-4o-mini")
reranker = CohereRerank(top_n=3, model="rerank-english-v3.0")
query_engine = index.as_query_engine(
    llm=llm,
    node_postprocessors=[reranker]
)
 
response = query_engine.query("What are the main concepts in my documents?")
print(response)

Full Pipeline: LangChain Approach

pip install langchain langchain-openai langchain-chroma langchain-cohere

from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.chains import RetrievalQA
 
# Step 1: Load documents
loader = DirectoryLoader("data/", glob="**/*.txt")
docs = loader.load()
 
# Step 2: Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
 
# Step 3-4: Embed and store in ChromaDB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
 
# Step 5: Retrieve with reranking
compressor = CohereRerank(model="rerank-english-v3.0")
retriever = ContextualCompressionRetriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
    base_compressor=compressor
)
 
# Step 6: Generate
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = qa_chain.invoke({"query": "What are the main concepts?"})
print(result["result"])

Advanced Techniques

Hybrid Search (Semantic + Keyword)

Combine dense vector search with sparse keyword (BM25) search for better recall, especially on proper nouns and exact terms.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
 
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 10
 
# Vector for semantic matching
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
 
# Combine with weighted ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # Favor semantic
)
 
results = hybrid_retriever.invoke("specific technical term")

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed that, then retrieve. Improves retrieval for vague or short queries by 5-15%.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
 
llm = ChatOpenAI(model="gpt-4o-mini")
 
def hyde_retrieve(query, vectorstore, llm):
    # Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a short paragraph answering: {query}"
    ).content
 
    # Embed the hypothetical answer for retrieval
    results = vectorstore.similarity_search(hypothetical, k=5)
    return results

Parent-Child Chunking

Use small chunks for precise retrieval but return the larger parent chunk to the LLM for full context.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
 
# Create small child chunks for retrieval
child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)
child_nodes = child_splitter.get_nodes_from_documents(docs)
 
# Each child node references its parent via metadata
# LlamaIndex auto-retriever can fetch parent context
index = VectorStoreIndex(child_nodes)
retriever = index.as_retriever(similarity_top_k=5)
# Configure to return parent nodes for generation

Reranking

Reranking refines the top-K retrieved results (e.g., 20 → 5) using cross-encoder models for more precise relevance scoring. This typically improves precision by 10-20%.

Reranker	Type	Cost	Best For
Cohere Rerank v3	API	$2/M docs	Production, multilingual
ms-marco-MiniLM-L-6-v2	Open-source	Free	Self-hosted, <100 docs/query

Environment Variables

Both approaches require API keys set as environment variables:

export OPENAI_API_KEY="your-openai-key"
export COHERE_API_KEY="your-cohere-key"

Decision Guide

graph TD A[Starting a RAG pipeline?] --> B{Prototype or Production?} B -->|Prototype| C[ChromaDB + OpenAI small] B -->|Production| D{Scale?} D -->|< 1M docs| E[Qdrant + BGE] D -->|> 1M docs| F[Pinecone + OpenAI large] C --> G{Need multilingual?} E --> G F --> G G -->|Yes| H[Use Cohere embed-v3] G -->|No| I[Keep current embeddings] H --> J[Add Cohere Rerank] I --> J J --> K[Deploy & Monitor]

Table of Contents