Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances large language models by retrieving relevant external documents and incorporating them into the model input prompt to generate more accurate, grounded responses. 1) Rather than relying solely on a model training data, RAG lets the AI look things up before it answers, dramatically reducing hallucinations and enabling access to current or proprietary information.
Large language models have two fundamental limitations. First, their knowledge has a cutoff date, so they cannot reference events or information after training. Second, they have no access to private or proprietary data such as internal company documents, personal notes, or specialized research. 2)
RAG addresses both limitations by letting the model retrieve and reference external information at query time, producing answers that are current, factual, and traceable to specific sources.
RAG operates through a three-phase pipeline:
1. Ingestion (Indexing)
Documents are processed, split into smaller chunks, converted into vector embeddings using an embedding model, and stored in a vector database for efficient retrieval. 3)
2. Retrieval
When a user asks a question, the query is converted into a vector embedding and matched against the indexed documents using similarity search such as cosine similarity. The most relevant chunks are retrieved, often using hybrid methods combining keyword and semantic search with reranking. 4)
3. Generation
The retrieved context is injected into the LLM prompt alongside the user question. The model synthesizes an evidence-based response, grounded in the retrieved documents, and can cite its sources. 5)
| Level | Description |
|---|---|
| Naive RAG | Basic retrieval and generation without advanced optimizations |
| Advanced RAG | Incorporates hybrid search, reranking, query expansion, and optimized chunking strategies |
| Agentic RAG | Uses AI agents for multi-step reasoning, routing, and self-correction |
| Limitation | Description | Mitigation |
|---|---|---|
| Retrieval quality | Irrelevant chunks lead to factual errors | Hybrid search, reranking, better embeddings |
| Context window limits | Excessive content causes truncation or dilution | Optimal chunking, reducing top-k results |
| Data freshness | Stale indexes produce outdated responses | Automated refresh triggers |
| Latency | Retrieval adds delay in real-time applications | Semantic caching, efficient indexing |
| Data quality | Depends on source relevance and accuracy | Quality indexing, governance layers |
The quality of RAG depends more on retrieval quality, including chunking, embeddings, and reranking, than on the LLM itself. 10)
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge update | Dynamic via index refreshes, no retraining | Static, requires retraining for updates |
| Cost | Lower, uses off-the-shelf LLMs | Higher, needs domain data and compute |
| Customization | External data injection, preserves general capabilities | Deep domain adaptation but risks catastrophic forgetting |
| Hallucination reduction | Grounds in evidence with citations | Improves via examples but does not link references |
| Privacy | Handles private data at retrieval time | Involves training on sensitive data |
Most production systems use a hybrid approach: RAG for factual grounding and fine-tuning for tone and domain-specific behavior. 12)