Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
A step-by-step debugging guide for Retrieval-Augmented Generation pipelines that return irrelevant, incomplete, or incorrect answers. Based on real production failure analysis and benchmarking data.
When RAG produces bad results, the instinct is to blame the LLM. In practice, the problem is almost always upstream of generation — in chunking, embedding, retrieval, or context assembly. A 2026 taxonomy from Layer 6 AI identified 7 distinct error categories in production RAG systems, most occurring before the LLM ever sees the query.
Symptoms: Retrieved chunks contain the answer buried in irrelevant text. LLM ignores the relevant part or gets confused by contradictory information in the same chunk.
Why it happens: Large chunks (>1024 tokens) dilute the embedding vector. The embedding represents a blend of all topics in the chunk, making it weakly similar to any specific query.
Fix: Target 256-512 tokens per chunk. Use semantic chunking that respects document structure (headings, paragraphs) rather than fixed token counts.
Symptoms: Retrieved chunks are relevant but lack enough context for the LLM to synthesize a complete answer. Answers are fragmented or miss important caveats.
Why it happens: Small chunks (<128 tokens) capture individual sentences but lose surrounding context. The embedding is precise but the content is insufficient.
Fix: Add overlap (50-100 tokens) between chunks. Use parent-child retrieval: retrieve the small chunk for precision, but pass the parent section for context.
Symptoms: Retrieval returns semantically similar but contextually wrong documents. Query about “database indexing” retrieves “search engine indexing.”
Why it happens: General-purpose embeddings may not capture domain-specific semantics. Embedding model mismatch between query and document encoders.
Fix: Benchmark multiple embedding models on your actual queries. Consider domain-specific fine-tuning. Use the MTEB leaderboard to select models.
Symptoms: Relevant documents exist in top-20 but not in top-5 passed to the LLM. Answer quality varies unpredictably between similar queries.
Why it happens: Vector similarity is a rough proxy for relevance. Without reranking, the ordering is based purely on embedding distance which misses nuanced relevance signals.
Fix: Add a cross-encoder reranker (e.g., Cohere Rerank, bge-reranker) between retrieval and generation. Retrieve top-20, rerank to top-5.
Symptoms: LLM answers correctly when the relevant chunk is first or last in context, but fails when it's in the middle. Performance degrades as you add more context chunks.
Why it happens: Research shows LLMs attend more to the beginning and end of their context window, paying less attention to middle sections (Liu et al., 2024).
Fix: Place the most relevant chunk first. Limit context to 3-5 chunks. Use reciprocal rank fusion to ensure the best chunk is always prominent.
Symptoms: RAG retrieves outdated versions of documents. Wrong department's policies returned. Answers mix information from incompatible sources.
Why it happens: Pure semantic search has no concept of recency, access control, or source categorization.
Fix: Add metadata fields (date, source, category, version) to every chunk. Apply pre-retrieval filters before vector search.
Symptoms: New documents aren't found. Answers reflect old information even after documents are updated. RAG contradicts what users know to be current.
Why it happens: Embeddings are computed at index time. If documents change without re-indexing, the vector store serves stale data.
Fix: Implement incremental indexing triggered by document updates. Add freshness scoring. Monitor index age vs. source document timestamps.
Feed known-good context directly to the LLM, bypassing retrieval. If the answer is correct, your problem is retrieval. If it's still wrong, your problem is generation/prompting.
# Quick isolation test def test_generation_directly(llm, question, known_good_context): """Bypass retrieval to test if LLM can answer with perfect context.""" prompt = f"Answer based only on this context:\n{known_good_context}\n\nQuestion: {question}" response = llm.invoke(prompt) print(f"Direct context answer: {response.content}") return response.content def test_retrieval_quality(retriever, question, expected_doc_ids): """Check if retrieval returns expected documents.""" docs = retriever.invoke(question) retrieved_ids = [doc.metadata.get("id") for doc in docs] recall = len(set(retrieved_ids) & set(expected_doc_ids)) / len(expected_doc_ids) print(f"Retrieval recall: {recall:.0%}") for i, doc in enumerate(docs): print(f" [{i+1}] {doc.metadata.get('source', 'unknown')}: {doc.page_content[:100]}...") return recall
def analyze_chunks(chunks: list, query: str): """Analyze chunk statistics to identify sizing issues.""" import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") stats = [] for i, chunk in enumerate(chunks): tokens = len(enc.encode(chunk.page_content)) stats.append({"index": i, "tokens": tokens, "preview": chunk.page_content[:80]}) avg_tokens = sum(s["tokens"] for s in stats) / len(stats) min_tokens = min(s["tokens"] for s in stats) max_tokens = max(s["tokens"] for s in stats) print(f"Chunk stats: avg={avg_tokens:.0f}, min={min_tokens}, max={max_tokens} tokens") if avg_tokens > 1024: print("WARNING: Chunks too large - embedding quality degrades above 512 tokens") elif avg_tokens < 100: print("WARNING: Chunks too small - insufficient context per chunk") else: print("OK: Chunk sizes in reasonable range") return stats
RAGAS (Retrieval Augmented Generation Assessment) provides automated metrics for RAG quality without requiring ground-truth labels for every question.
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset from your RAG pipeline traces eval_data = { "question": [ "What is the refund policy for enterprise plans?", "How do I configure SSO with SAML?", ], "answer": [ # Actual answers from your RAG pipeline "Enterprise plans have a 30-day refund window...", "To configure SSO, navigate to Settings > Security...", ], "contexts": [ # Retrieved chunks for each question ["Enterprise refund policy: 30-day window for annual plans..."], ["SSO Configuration Guide: Go to Settings > Security > SAML..."], ], "ground_truth": [ # Human-verified correct answers (needed for context_recall) "Enterprise customers can request refunds within 30 days...", "SSO with SAML is configured under Settings > Security...", ], } dataset = Dataset.from_dict(eval_data) # Run evaluation results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(results) # Output: {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.90} # Interpret results: # - faithfulness < 0.8 -> LLM is hallucinating beyond retrieved context # - answer_relevancy < 0.8 -> Answers don't address the question # - context_precision < 0.8 -> Retrieved chunks contain too much noise # - context_recall < 0.8 -> Retrieval is missing relevant documents
from langchain.retrievers import ContextualCompressionRetriever from langchain_cohere import CohereRerank # Wrap your base retriever with a reranker base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) reranker = CohereRerank( model="rerank-english-v3.0", top_n=5 # Return top 5 after reranking ) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) # Now retrieval quality is significantly better docs = retriever.invoke("What is the refund policy?")
Run this to find the optimal chunk size for your specific data:
from ragas import evaluate from ragas.metrics import context_precision, context_recall from langchain.text_splitter import RecursiveCharacterTextSplitter from datasets import Dataset def benchmark_chunk_sizes(documents, test_questions, ground_truths, embedding_model): """Test different chunk sizes and measure retrieval quality.""" results = {} for chunk_size in [128, 256, 512, 768, 1024, 1536]: splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=int(chunk_size * 0.15), # 15% overlap ) chunks = splitter.split_documents(documents) print(f"Chunk size {chunk_size}: {len(chunks)} chunks created") # Build temp vectorstore and retriever temp_store = Chroma.from_documents(chunks, embedding_model) retriever = temp_store.as_retriever(search_kwargs={"k": 5}) # Retrieve for each question contexts = [] for q in test_questions: docs = retriever.invoke(q) contexts.append([d.page_content for d in docs]) # Evaluate with RAGAS eval_dataset = Dataset.from_dict({ "question": test_questions, "contexts": contexts, "ground_truth": ground_truths, "answer": [""] * len(test_questions), }) score = evaluate(eval_dataset, metrics=[context_precision, context_recall]) results[chunk_size] = score print(f" Precision: {score['context_precision']:.3f}, Recall: {score['context_recall']:.3f}") return results