====== DeepEval ======(([[https://deepeval.com/docs/evaluation-introduction|Documentation — Introduction]]))(([[https://deepeval.com|Official Website]]))
**DeepEval** is an open-source LLM evaluation framework by **Confident AI** that brings unit-test style testing to AI applications.(([[https://www.confident-ai.com|Confident AI Platform]])) With over **14,000 stars** on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.(([[https://github.com/confident-ai/deepeval|GitHub Repository]]))
DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code.
===== How Unit-Test Style Evaluation Works =====
DeepEval models each LLM interaction as a **test case** (''LLMTestCase'' or ''ConversationalTestCase''), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems.
**Metrics** are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push.
===== Key Features =====
* **Pytest integration** — Write LLM tests with familiar ''@pytest.mark.parametrize'' and ''assert_test''(([[https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd|CI/CD Integration Guide]]))
* **Pre-built metrics** — Faithfulness, relevancy, hallucination, toxicity, bias, and more
* **Custom metrics (G-Eval)** — LLM-as-judge with custom criteria and evaluation steps
* **Dataset management** — Golden datasets with expected inputs/outputs for batch testing
* **Conversation testing** — Multi-turn evaluation via ''ConversationalTestCase''
* **CI/CD ready** — Run on every push to detect regressions automatically
* **Cloud collaboration** — Optional Confident AI platform for team-wide testing
===== Installation and Usage =====
# Install DeepEval
# pip install deepeval
from deepeval import assert_test, evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
GEval
)
# Define a test case for a RAG application
def test_rag_response():
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output=your_rag_app("What is the refund policy?"),
expected_output="Full refund within 30 days of purchase",
retrieval_context=[
"Our refund policy allows full refunds within 30 days."
]
)
# Apply metrics with thresholds
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
assert_test(test_case, [relevancy, faithfulness])
# Custom metric with G-Eval (LLM-as-judge)
correctness = GEval(
name="Correctness",
criteria="The response should be factually correct and complete",
evaluation_steps=[
"Check if the response contains accurate information",
"Verify all key points from expected output are covered",
"Ensure no fabricated information is present"
],
threshold=0.8
)
# Batch evaluation with datasets
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[
LLMTestCase(input="Q1", actual_output="A1"),
LLMTestCase(input="Q2", actual_output="A2"),
])
results = evaluate(dataset, [relevancy, correctness])
===== Architecture =====
%%{init: {'theme': 'dark'}}%%
graph TB
Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite]
Tests -->|LLMTestCase| TC[Test Cases]
TC -->|input + actual_output| App[Your LLM App]
TC -->|expected_output| Golden[Golden Dataset]
TC -->|retrieval_context| RAG[RAG Pipeline]
Tests -->|Metrics| Metrics{Metric Engine}
Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination]
Metrics -->|Custom| GEval[G-Eval LLM Judge]
Metrics -->|Score + Reason| Results[Test Results]
Results -->|Pass / Fail| CI[CI/CD Pipeline]
Results -->|Dashboard| Cloud[Confident AI Platform]
CI -->|On Push| GHA[GitHub Actions]
GHA -->|Regression Alert| Team([Development Team])
===== Available Metrics =====
^ Category ^ Metrics ^ Description ^
| RAG-Specific | Contextual Recall, Precision, Relevancy, Faithfulness | Evaluate retrieval and generation quality |
| General | Answer Relevancy, Summarization | Overall response quality |
| Safety | Hallucination, Bias, Toxicity | Content safety checks |
| Custom | G-Eval, RAGAS | LLM-as-judge with custom criteria |
===== See Also =====
* [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]]
* [[arize_phoenix|Arize Phoenix — AI Observability]]
* [[guidance|Guidance — Structured Generation Language]]
===== References =====