RAGAS: RAG Evaluation Framework

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source Python framework for evaluating RAG pipelines, with over 9,000 GitHub stars.¹⁾ Introduced by Shahul Es et al. in their 2023 paper, RAGAS provides reference-free metrics that assess both retrieval quality and generation quality without requiring ground truth human annotations.²⁾ It has become the de facto standard for measuring RAG system performance in production.

The framework addresses a critical gap in the AI ecosystem: while building a RAG prototype is easy, measuring whether it actually works well is hard. RAGAS replaces subjective “vibe checks” with quantitative metrics that can run in CI pipelines, enabling systematic optimization of each pipeline stage independently.³⁾

Core Metrics

RAGAS evaluates RAG systems across four primary dimensions, covering both the retrieval and generation stages:⁴⁾

Generation Metrics

Faithfulness, Measures factual consistency between the generated answer and retrieved context. Calculated as the ratio of claims in the answer that can be inferred from the context to total claims. Identifies hallucinations where the LLM generates unsupported information.
Answer Relevancy, Evaluates how well the answer addresses the original question. Computed by generating artificial questions from the answer and measuring semantic similarity to the original query. Catches off-topic or incomplete responses.

Retrieval Metrics

Context Precision, Assesses how much of the retrieved context is actually relevant. Identifies noise and irrelevant chunks that may confuse the generator. Helps optimize retrieval parameters like top-k and chunk size.
Context Recall, Measures whether all necessary information to answer the question was retrieved. Ensures comprehensive retrieval even when relevant information is spread across multiple documents.

Code Example

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
# Prepare evaluation dataset with questions, contexts, answers, and ground truth
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
    ],
    "answer": [
        "The capital of France is Paris, located on the Seine River.",
        "Photosynthesis converts CO2 and water into glucose using sunlight.",
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France, on the Seine."],
        ["Photosynthesis is the process by which plants convert light energy "
         "into chemical energy, transforming CO2 and water into glucose and oxygen."],
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy to chemical energy in plants.",
    ],
}
 
dataset = Dataset.from_dict(eval_data)
 
# Run evaluation with all four core metrics
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
 
# Print per-metric scores
print(results)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.92,
#          'context_precision': 0.88, 'context_recall': 0.90}
 
# Convert to pandas for detailed analysis
df = results.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])

Evaluation Pipeline

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#E74C3C"}}}%%
graph TD
    A[RAG Pipeline Output] --> B[RAGAS Evaluation]
    B --> C[Faithfulness Check]
    B --> D[Answer Relevancy Check]
    B --> E[Context Precision Check]
    B --> F[Context Recall Check]
    C --> G[LLM Judge: Extract Claims]
    G --> H[Verify Claims vs Context]
    D --> I[Generate Artificial Questions]
    I --> J[Cosine Similarity to Original]
    E --> K[Rank Retrieved Chunks]
    K --> L[Precision Score]
    F --> M[Compare Context to Ground Truth]
    M --> N[Recall Score]
    H --> O[Aggregate Scores]
    J --> O
    L --> O
    N --> O
    O --> P[RAGAS Score Report]
    P --> Q[CI Pipeline Integration]

Synthetic Test Data Generation

RAGAS includes powerful test data generation to avoid manual annotation:⁵⁾

SingleHopSpecificQuerySynthesizer, Straightforward fact-based queries from single documents (50% default)
MultiHopAbstractQuerySynthesizer, Complex queries requiring synthesis across multiple documents (25% default)
Saves up to 90% of development time compared to manual test set creation
Uses knowledge graph construction to extract entities, themes, and relationships

Key Features

Reference-Free, No ground truth annotations required for core metrics
LLM-as-Judge, Uses LLMs for structured evaluation with zero-shot prompting
CI/CD Ready, Integrates into automated testing pipelines
LangChain Compatible, Works with LangChain, LlamaIndex, and custom pipelines
Synthetic Data, Automatic test set generation from your documents
Extensible, Custom metrics via the Metric base class

References

¹⁾

https://github.com/explodinggradients/ragas

²⁾ , ⁴⁾

https://arxiv.org/abs/2309.15217

³⁾ , ⁵⁾

https://docs.ragas.io

Table of Contents