deepeval

How Unit-Test Style Evaluation Works
Key Features
Installation and Usage
Architecture
Available Metrics
See Also
References

====== DeepEval ======¹⁾²⁾

DeepEval is an open-source LLM evaluation framework by Confident AI that brings unit-test style testing to AI applications.³⁾ With over 14,000 stars on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.⁴⁾

DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code.

How Unit-Test Style Evaluation Works

DeepEval models each LLM interaction as a test case (LLMTestCase or ConversationalTestCase), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems.

Metrics are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push.

Key Features

Pytest integration — Write LLM tests with familiar @pytest.mark.parametrize and assert_test⁵⁾
Pre-built metrics — Faithfulness, relevancy, hallucination, toxicity, bias, and more
Custom metrics (G-Eval) — LLM-as-judge with custom criteria and evaluation steps
Dataset management — Golden datasets with expected inputs/outputs for batch testing
Conversation testing — Multi-turn evaluation via ConversationalTestCase
CI/CD ready — Run on every push to detect regressions automatically
Cloud collaboration — Optional Confident AI platform for team-wide testing

Installation and Usage

# Install DeepEval
# pip install deepeval
 
from deepeval import assert_test, evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval
)
 
# Define a test case for a RAG application
def test_rag_response():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output=your_rag_app("What is the refund policy?"),
        expected_output="Full refund within 30 days of purchase",
        retrieval_context=[
            "Our refund policy allows full refunds within 30 days."
        ]
    )
 
    # Apply metrics with thresholds
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)
 
    assert_test(test_case, [relevancy, faithfulness])
 
# Custom metric with G-Eval (LLM-as-judge)
correctness = GEval(
    name="Correctness",
    criteria="The response should be factually correct and complete",
    evaluation_steps=[
        "Check if the response contains accurate information",
        "Verify all key points from expected output are covered",
        "Ensure no fabricated information is present"
    ],
    threshold=0.8
)
 
# Batch evaluation with datasets
from deepeval.dataset import EvaluationDataset
 
dataset = EvaluationDataset(test_cases=[
    LLMTestCase(input="Q1", actual_output="A1"),
    LLMTestCase(input="Q2", actual_output="A2"),
])
results = evaluate(dataset, [relevancy, correctness])

Architecture

%%{init: {'theme': 'dark'}}%%
graph TB
    Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite]
    Tests -->|LLMTestCase| TC[Test Cases]
    TC -->|input + actual_output| App[Your LLM App]
    TC -->|expected_output| Golden[Golden Dataset]
    TC -->|retrieval_context| RAG[RAG Pipeline]
    Tests -->|Metrics| Metrics{Metric Engine}
    Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination]
    Metrics -->|Custom| GEval[G-Eval LLM Judge]
    Metrics -->|Score + Reason| Results[Test Results]
    Results -->|Pass / Fail| CI[CI/CD Pipeline]
    Results -->|Dashboard| Cloud[Confident AI Platform]
    CI -->|On Push| GHA[GitHub Actions]
    GHA -->|Regression Alert| Team([Development Team])

Available Metrics

Category	Metrics	Description
RAG-Specific	Contextual Recall, Precision, Relevancy, Faithfulness	Evaluate retrieval and generation quality
General	Answer Relevancy, Summarization	Overall response quality
Safety	Hallucination, Bias, Toxicity	Content safety checks
Custom	G-Eval, RAGAS	LLM-as-judge with custom criteria