====== DeepEval ======(([[https://deepeval.com/docs/evaluation-introduction|Documentation — Introduction]]))(([[https://deepeval.com|Official Website]])) **DeepEval** is an open-source LLM evaluation framework by **Confident AI** that brings unit-test style testing to AI applications.(([[https://www.confident-ai.com|Confident AI Platform]])) With over **14,000 stars** on GitHub, it integrates with Pytest to let developers write test cases for LLM outputs — catching regressions, validating quality, and measuring metrics like faithfulness, relevancy, hallucination, and toxicity in CI/CD pipelines.(([[https://github.com/confident-ai/deepeval|GitHub Repository]])) DeepEval treats LLM interactions as testable units, mirroring the rigor of software engineering testing practices. Each interaction becomes a test case with inputs, outputs, and measurable assertions — enabling teams to ship AI features with the same confidence they ship traditional code. ===== How Unit-Test Style Evaluation Works ===== DeepEval models each LLM interaction as a **test case** (''LLMTestCase'' or ''ConversationalTestCase''), similar to a unit test in traditional software development. Each test case contains an input, the actual output from your LLM application, optional expected output, and retrieval context for RAG systems. **Metrics** are applied to test cases with configurable thresholds. If the metric score falls below the threshold, the test fails — just like a failing assertion in Pytest. This integrates directly into CI/CD pipelines to catch regressions on every push. ===== Key Features ===== * **Pytest integration** — Write LLM tests with familiar ''@pytest.mark.parametrize'' and ''assert_test''(([[https://deepeval.com/docs/evaluation-unit-testing-in-ci-cd|CI/CD Integration Guide]])) * **Pre-built metrics** — Faithfulness, relevancy, hallucination, toxicity, bias, and more * **Custom metrics (G-Eval)** — LLM-as-judge with custom criteria and evaluation steps * **Dataset management** — Golden datasets with expected inputs/outputs for batch testing * **Conversation testing** — Multi-turn evaluation via ''ConversationalTestCase'' * **CI/CD ready** — Run on every push to detect regressions automatically * **Cloud collaboration** — Optional Confident AI platform for team-wide testing ===== Installation and Usage ===== # Install DeepEval # pip install deepeval from deepeval import assert_test, evaluate from deepeval.test_case import LLMTestCase from deepeval.metrics import ( AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, GEval ) # Define a test case for a RAG application def test_rag_response(): test_case = LLMTestCase( input="What is the refund policy?", actual_output=your_rag_app("What is the refund policy?"), expected_output="Full refund within 30 days of purchase", retrieval_context=[ "Our refund policy allows full refunds within 30 days." ] ) # Apply metrics with thresholds relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8) assert_test(test_case, [relevancy, faithfulness]) # Custom metric with G-Eval (LLM-as-judge) correctness = GEval( name="Correctness", criteria="The response should be factually correct and complete", evaluation_steps=[ "Check if the response contains accurate information", "Verify all key points from expected output are covered", "Ensure no fabricated information is present" ], threshold=0.8 ) # Batch evaluation with datasets from deepeval.dataset import EvaluationDataset dataset = EvaluationDataset(test_cases=[ LLMTestCase(input="Q1", actual_output="A1"), LLMTestCase(input="Q2", actual_output="A2"), ]) results = evaluate(dataset, [relevancy, correctness]) ===== Architecture ===== %%{init: {'theme': 'dark'}}%% graph TB Dev([Developer]) -->|Write Tests| Tests[Pytest Test Suite] Tests -->|LLMTestCase| TC[Test Cases] TC -->|input + actual_output| App[Your LLM App] TC -->|expected_output| Golden[Golden Dataset] TC -->|retrieval_context| RAG[RAG Pipeline] Tests -->|Metrics| Metrics{Metric Engine} Metrics -->|Pre-built| PreBuilt[Relevancy / Faithfulness / Hallucination] Metrics -->|Custom| GEval[G-Eval LLM Judge] Metrics -->|Score + Reason| Results[Test Results] Results -->|Pass / Fail| CI[CI/CD Pipeline] Results -->|Dashboard| Cloud[Confident AI Platform] CI -->|On Push| GHA[GitHub Actions] GHA -->|Regression Alert| Team([Development Team]) ===== Available Metrics ===== ^ Category ^ Metrics ^ Description ^ | RAG-Specific | Contextual Recall, Precision, Relevancy, Faithfulness | Evaluate retrieval and generation quality | | General | Answer Relevancy, Summarization | Overall response quality | | Safety | Hallucination, Bias, Toxicity | Content safety checks | | Custom | G-Eval, RAGAS | LLM-as-judge with custom criteria | ===== See Also ===== * [[promptfoo|Promptfoo — LLM Evaluation and Red Teaming]] * [[arize_phoenix|Arize Phoenix — AI Observability]] * [[guidance|Guidance — Structured Generation Language]] ===== References =====