Table of Contents

RAGAS: RAG Evaluation Framework

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source Python framework for evaluating RAG pipelines, with over 9,000 GitHub stars. Introduced by Shahul Es et al. in their 2023 paper, RAGAS provides reference-free metrics that assess both retrieval quality and generation quality without requiring ground truth human annotations. It has become the de facto standard for measuring RAG system performance in production.

The framework addresses a critical gap in the AI ecosystem: while building a RAG prototype is easy, measuring whether it actually works well is hard. RAGAS replaces subjective “vibe checks” with quantitative metrics that can run in CI pipelines, enabling systematic optimization of each pipeline stage independently.

Core Metrics

RAGAS evaluates RAG systems across four primary dimensions, covering both the retrieval and generation stages:

Generation Metrics

Retrieval Metrics

Code Example

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
# Prepare evaluation dataset with questions, contexts, answers, and ground truth
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
    ],
    "answer": [
        "The capital of France is Paris, located on the Seine River.",
        "Photosynthesis converts CO2 and water into glucose using sunlight.",
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France, on the Seine."],
        ["Photosynthesis is the process by which plants convert light energy "
         "into chemical energy, transforming CO2 and water into glucose and oxygen."],
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy to chemical energy in plants.",
    ],
}
 
dataset = Dataset.from_dict(eval_data)
 
# Run evaluation with all four core metrics
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
 
# Print per-metric scores
print(results)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.92,
#          'context_precision': 0.88, 'context_recall': 0.90}
 
# Convert to pandas for detailed analysis
df = results.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])

Evaluation Pipeline

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#E74C3C"}}}%%
graph TD
    A[RAG Pipeline Output] --> B[RAGAS Evaluation]
    B --> C[Faithfulness Check]
    B --> D[Answer Relevancy Check]
    B --> E[Context Precision Check]
    B --> F[Context Recall Check]
    C --> G[LLM Judge: Extract Claims]
    G --> H[Verify Claims vs Context]
    D --> I[Generate Artificial Questions]
    I --> J[Cosine Similarity to Original]
    E --> K[Rank Retrieved Chunks]
    K --> L[Precision Score]
    F --> M[Compare Context to Ground Truth]
    M --> N[Recall Score]
    H --> O[Aggregate Scores]
    J --> O
    L --> O
    N --> O
    O --> P[RAGAS Score Report]
    P --> Q[CI Pipeline Integration]

Synthetic Test Data Generation

RAGAS includes powerful test data generation to avoid manual annotation:

Key Features

References

See Also