====== DeepEval ======(([[https://deepeval.com/docs/evaluation-introduction|Documentation — Introduction]]))(([[https://deepeval.com|Official Website]]))

**DeepEval** is an open-source LLM evaluation framework by **Confident AI** that brings unit-test style testing to AI applications.(([[https://www.confident-ai.com|Confident AI Platform]])) With over **14,000 stars** on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.(([[https://github.com/confident-ai/deepeval|GitHub Repository]]))

DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code.

===== How Unit-Test Style Evaluation Works =====

DeepEval models each LLM interaction as a **test case** (''LLMTestCase'' or ''ConversationalTestCase''), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems.

**Metrics** are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push.

===== Key Features =====

  * **Pytest integration** — Write LLM tests with familiar ''@pytest.mark.parametrize'' and ''assert_test''(([[https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd|CI/CD Integration Guide]]))
  * **Pre-built metrics** — Faithfulness, relevancy, hallucination, toxicity, bias, and more
  * **Custom metrics (G-Eval)** — LLM-as-judge with custom criteria and evaluation steps
  * **Dataset management** — Golden datasets with expected inputs/outputs for batch testing
  * **Conversation testing** — Multi-turn evaluation via ''ConversationalTestCase''
  * **CI/CD ready** — Run on every push to detect regressions automatically
  * **Cloud collaboration** — Optional Confident AI platform for team-wide testing

===== Installation and Usage =====

<code python>
# Install DeepEval
# pip install deepeval

from deepeval import assert_test, evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval
)

# Define a test case for a RAG application
def test_rag_response():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output=your_rag_app("What is the refund policy?"),
        expected_output="Full refund within 30 days of purchase",
        retrieval_context=[
            "Our refund policy allows full refunds within 30 days."
        ]
    )

    # Apply metrics with thresholds
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)

    assert_test(test_case, [relevancy, faithfulness])

# Custom metric with G-Eval (LLM-as-judge)
correctness = GEval(
    name="Correctness",
    criteria="The response should be factually correct and complete",
    evaluation_steps=[
        "Check if the response contains accurate information",
        "Verify all key points from expected output are covered",
        "Ensure no fabricated information is present"
    ],
    threshold=0.8
)

# Batch evaluation with datasets
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(test_cases=[
    LLMTestCase(input="Q1", actual_output="A1"),
    LLMTestCase(input="Q2", actual_output="A2"),
])
results = evaluate(dataset, [relevancy, correctness])
</code>

===== Architecture =====

<code>
%%{init: {'theme': 'dark'}}%%
graph TB
    Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite]
    Tests -->|LLMTestCase| TC[Test Cases]
    TC -->|input + actual_output| App[Your LLM App]
    TC -->|expected_output| Golden[Golden Dataset]
    TC -->|retrieval_context| RAG[RAG Pipeline]
    Tests -->|Metrics| Metrics{Metric Engine}
    Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination]
    Metrics -->|Custom| GEval[G-Eval LLM Judge]
    Metrics -->|Score + Reason| Results[Test Results]
    Results -->|Pass / Fail| CI[CI/CD Pipeline]
    Results -->|Dashboard| Cloud[Confident AI Platform]
    CI -->|On Push| GHA[GitHub Actions]
    GHA -->|Regression Alert| Team([Development Team])
</code>

===== Available Metrics =====

^ Category ^ Metrics ^ Description ^
| RAG-Specific | Contextual Recall, Precision, Relevancy, Faithfulness | Evaluate retrieval and generation quality |
| General | Answer Relevancy, Summarization | Overall response quality |
| Safety | Hallucination, Bias, Toxicity | Content safety checks |
| Custom | G-Eval, RAGAS | LLM-as-judge with custom criteria |

===== See Also =====

  * [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]]
  * [[arize_phoenix|Arize Phoenix — AI Observability]]
  * [[guidance|Guidance — Structured Generation Language]]

===== References =====