RAGAS (Retrieval-Augmented Generation Assessment) is an open-source Python framework for evaluating RAG pipelines, with over 9,000 GitHub stars. Introduced by Shahul Es et al. in their 2023 paper, RAGAS provides reference-free metrics that assess both retrieval quality and generation quality without requiring ground truth human annotations. It has become the de facto standard for measuring RAG system performance in production.
The framework addresses a critical gap in the AI ecosystem: while building a RAG prototype is easy, measuring whether it actually works well is hard. RAGAS replaces subjective “vibe checks” with quantitative metrics that can run in CI pipelines, enabling systematic optimization of each pipeline stage independently.
RAGAS evaluates RAG systems across four primary dimensions, covering both the retrieval and generation stages:
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset with questions, contexts, answers, and ground truth eval_data = { "question": [ "What is the capital of France?", "How does photosynthesis work?", ], "answer": [ "The capital of France is Paris, located on the Seine River.", "Photosynthesis converts CO2 and water into glucose using sunlight.", ], "contexts": [ ["Paris is the capital and most populous city of France, on the Seine."], ["Photosynthesis is the process by which plants convert light energy " "into chemical energy, transforming CO2 and water into glucose and oxygen."], ], "ground_truth": [ "Paris is the capital of France.", "Photosynthesis converts light energy to chemical energy in plants.", ], } dataset = Dataset.from_dict(eval_data) # Run evaluation with all four core metrics results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) # Print per-metric scores print(results) # Output: {'faithfulness': 0.95, 'answer_relevancy': 0.92, # 'context_precision': 0.88, 'context_recall': 0.90} # Convert to pandas for detailed analysis df = results.to_pandas() print(df[["question", "faithfulness", "answer_relevancy"]])
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#E74C3C"}}}%%
graph TD
A[RAG Pipeline Output] --> B[RAGAS Evaluation]
B --> C[Faithfulness Check]
B --> D[Answer Relevancy Check]
B --> E[Context Precision Check]
B --> F[Context Recall Check]
C --> G[LLM Judge: Extract Claims]
G --> H[Verify Claims vs Context]
D --> I[Generate Artificial Questions]
I --> J[Cosine Similarity to Original]
E --> K[Rank Retrieved Chunks]
K --> L[Precision Score]
F --> M[Compare Context to Ground Truth]
M --> N[Recall Score]
H --> O[Aggregate Scores]
J --> O
L --> O
N --> O
O --> P[RAGAS Score Report]
P --> Q[CI Pipeline Integration]
RAGAS includes powerful test data generation to avoid manual annotation: