Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
DeepEval is an open-source LLM evaluation framework by Confident AI that brings unit-test style testing to AI applications.3) With over 14,000 stars on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.4)
DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code.
DeepEval models each LLM interaction as a test case (LLMTestCase or ConversationalTestCase), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems.
Metrics are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push.
ConversationalTestCase# Install DeepEval # pip install deepeval from deepeval import assert_test, evaluate from deepeval.test_case import LLMTestCase from deepeval.metrics import ( AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, GEval ) # Define a test case for a RAG application def test_rag_response(): test_case = LLMTestCase( input="What is the refund policy?", actual_output=your_rag_app("What is the refund policy?"), expected_output="Full refund within 30 days of purchase", retrieval_context=[ "Our refund policy allows full refunds within 30 days." ] ) # Apply metrics with thresholds relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8) assert_test(test_case, [relevancy, faithfulness]) # Custom metric with G-Eval (LLM-as-judge) correctness = GEval( name="Correctness", criteria="The response should be factually correct and complete", evaluation_steps=[ "Check if the response contains accurate information", "Verify all key points from expected output are covered", "Ensure no fabricated information is present" ], threshold=0.8 ) # Batch evaluation with datasets from deepeval.dataset import EvaluationDataset dataset = EvaluationDataset(test_cases=[ LLMTestCase(input="Q1", actual_output="A1"), LLMTestCase(input="Q2", actual_output="A2"), ]) results = evaluate(dataset, [relevancy, correctness])
%%{init: {'theme': 'dark'}}%%
graph TB
Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite]
Tests -->|LLMTestCase| TC[Test Cases]
TC -->|input + actual_output| App[Your LLM App]
TC -->|expected_output| Golden[Golden Dataset]
TC -->|retrieval_context| RAG[RAG Pipeline]
Tests -->|Metrics| Metrics{Metric Engine}
Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination]
Metrics -->|Custom| GEval[G-Eval LLM Judge]
Metrics -->|Score + Reason| Results[Test Results]
Results -->|Pass / Fail| CI[CI/CD Pipeline]
Results -->|Dashboard| Cloud[Confident AI Platform]
CI -->|On Push| GHA[GitHub Actions]
GHA -->|Regression Alert| Team([Development Team])
| Category | Metrics | Description |
|---|---|---|
| RAG-Specific | Contextual Recall, Precision, Relevancy, Faithfulness | Evaluate retrieval and generation quality |
| General | Answer Relevancy, Summarization | Overall response quality |
| Safety | Hallucination, Bias, Toxicity | Content safety checks |
| Custom | G-Eval, RAGAS | LLM-as-judge with custom criteria |