Table of Contents

====== DeepEval ======1)2)

DeepEval is an open-source LLM evaluation framework by Confident AI that brings unit-test style testing to AI applications.3) With over 14,000 stars on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.4)

DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code.

How Unit-Test Style Evaluation Works

DeepEval models each LLM interaction as a test case (LLMTestCase or ConversationalTestCase), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems.

Metrics are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push.

Key Features

Installation and Usage

# Install DeepEval
# pip install deepeval
 
from deepeval import assert_test, evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval
)
 
# Define a test case for a RAG application
def test_rag_response():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output=your_rag_app("What is the refund policy?"),
        expected_output="Full refund within 30 days of purchase",
        retrieval_context=[
            "Our refund policy allows full refunds within 30 days."
        ]
    )
 
    # Apply metrics with thresholds
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)
 
    assert_test(test_case, [relevancy, faithfulness])
 
# Custom metric with G-Eval (LLM-as-judge)
correctness = GEval(
    name="Correctness",
    criteria="The response should be factually correct and complete",
    evaluation_steps=[
        "Check if the response contains accurate information",
        "Verify all key points from expected output are covered",
        "Ensure no fabricated information is present"
    ],
    threshold=0.8
)
 
# Batch evaluation with datasets
from deepeval.dataset import EvaluationDataset
 
dataset = EvaluationDataset(test_cases=[
    LLMTestCase(input="Q1", actual_output="A1"),
    LLMTestCase(input="Q2", actual_output="A2"),
])
results = evaluate(dataset, [relevancy, correctness])

Architecture

%%{init: {'theme': 'dark'}}%%
graph TB
    Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite]
    Tests -->|LLMTestCase| TC[Test Cases]
    TC -->|input + actual_output| App[Your LLM App]
    TC -->|expected_output| Golden[Golden Dataset]
    TC -->|retrieval_context| RAG[RAG Pipeline]
    Tests -->|Metrics| Metrics{Metric Engine}
    Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination]
    Metrics -->|Custom| GEval[G-Eval LLM Judge]
    Metrics -->|Score + Reason| Results[Test Results]
    Results -->|Pass / Fail| CI[CI/CD Pipeline]
    Results -->|Dashboard| Cloud[Confident AI Platform]
    CI -->|On Push| GHA[GitHub Actions]
    GHA -->|Regression Alert| Team([Development Team])

Available Metrics

Category Metrics Description
RAG-Specific Contextual Recall, Precision, Relevancy, Faithfulness Evaluate retrieval and generation quality
General Answer Relevancy, Summarization Overall response quality
Safety Hallucination, Bias, Toxicity Content safety checks
Custom G-Eval, RAGAS LLM-as-judge with custom criteria

See Also

References