How to Evaluate an Agent

A practical guide to measuring AI agent quality. Covers key metrics, industry benchmarks, evaluation frameworks (RAGAS, DeepEval), and testing strategies from unit tests to A/B testing in production.

Evaluation Framework Overview

graph LR subgraph Metrics A[Task Success Rate] B[Tool Call Accuracy] C[Reasoning Quality] D[Latency] E[Cost per Task] end subgraph Methods F[Automated Evals] G[Benchmark Suites] H[Human Review] I[A/B Testing] end subgraph Tools J[RAGAS] K[DeepEval] L[Langfuse] M[Custom Framework] end A --> F B --> F C --> H D --> F E --> F F --> J F --> K G --> J H --> L I --> M

Key Metrics

Metric	What It Measures	Target	Red Flag	How to Compute
Task Success Rate	% of tasks fully completed	>87%	<72%	completed_tasks / total_tasks
Tool Call Accuracy	Correct tool selection + parameters	>95%	<80%	correct_calls / total_calls
Reasoning Quality	Faithfulness, minimal hallucination	Hallucination <3%	>10%	LLM-as-judge or human review
Latency	Time per response or full task	<4s per response	>10s	Measure end-to-end wall time
Cost per Task	Tokens + API calls + compute	Track and optimize	Unbounded growth	sum(tokens * price) per task

Benchmarks: Which One for What

Benchmark	Focus Area	What It Tests	When to Use
SWE-bench	Coding agents	Resolve real GitHub issues end-to-end	Evaluating code generation/editing agents
GAIA	General agents	Real-world multi-step reasoning with tools	General-purpose agent evaluation
AgentBench	Multi-domain	Tool use, planning, persistence across domains	Broad agent capability assessment
WebArena	Web automation	Navigate real websites, complete tasks	Browser/web interaction agents
BrowseComp	Web research	Find specific information across the web	Research and information retrieval agents

Benchmark Decision Guide

graph TD A[What does your agent do?] --> B{Primary task?} B -->|Write/edit code| C[SWE-bench] B -->|Browse the web| D[WebArena] B -->|General reasoning + tools| E[GAIA] B -->|Multiple domains| F[AgentBench] B -->|Research tasks| G[BrowseComp] C --> H[Also add custom evals for your codebase] D --> H E --> H F --> H G --> H

Evaluating with RAGAS

RAGAS is the standard for evaluating RAG pipelines. It measures faithfulness (is the answer grounded in context?), answer relevancy, and context precision.

pip install ragas datasets

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
import os
 
os.environ["OPENAI_API_KEY"] = "your-key"
 
# Prepare evaluation dataset
# Each row: question, contexts (retrieved), answer (agent output), ground_truth
eval_data = {
    "question": [
        "What is retrieval augmented generation?",
        "How does vector search work?",
        "What embedding models are best for RAG?"
    ],
    "contexts": [
        ["RAG combines retrieval from a knowledge base with LLM generation to produce grounded answers."],
        ["Vector search uses approximate nearest neighbor algorithms like HNSW to find similar embeddings."],
        ["Popular embedding models include OpenAI text-embedding-3, Cohere embed-v3, and open-source BGE."]
    ],
    "answer": [
        "RAG is a technique that retrieves relevant documents and uses them as context for an LLM to generate answers.",
        "Vector search converts text to embeddings and finds the closest vectors using algorithms like HNSW.",
        "The best embedding models for RAG include OpenAI's text-embedding-3-small and Cohere's embed-v3."
    ],
    "ground_truth": [
        "RAG retrieves relevant documents from a knowledge base and provides them as context to an LLM for generation.",
        "Vector search embeds text as vectors and uses ANN algorithms to find similar items efficiently.",
        "Top embedding models include OpenAI text-embedding-3, Cohere embed-v3, and BGE for open-source."
    ]
}
 
dataset = Dataset.from_dict(eval_data)
 
# Run evaluation
scores = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
 
print("RAGAS Scores:")
print(f"  Faithfulness:      {scores['faithfulness']:.3f}")
print(f"  Answer Relevancy:  {scores['answer_relevancy']:.3f}")
print(f"  Context Precision: {scores['context_precision']:.3f}")
print(f"  Context Recall:    {scores['context_recall']:.3f}")
 
# Targets: faithfulness > 0.9, relevancy > 0.85, precision > 0.8

Evaluating with DeepEval

DeepEval provides metrics for LLM outputs including reasoning quality, hallucination detection, and tool call accuracy.

pip install deepeval

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    ToolCorrectnessMetric
)
from deepeval.test_case import LLMTestCase
import os
 
os.environ["OPENAI_API_KEY"] = "your-key"
 
# Define test cases
test_cases = [
    LLMTestCase(
        input="What is the weather in NYC?",
        actual_output="The weather in NYC is currently 72F and sunny.",
        retrieval_context=["NYC weather: 72F, sunny, humidity 45%"],
        expected_output="NYC is 72F and sunny."
    ),
    LLMTestCase(
        input="Search for Python tutorials",
        actual_output='<tool_call name="web_search">{"query": "Python tutorials"}</tool_call>',
        retrieval_context=["Available tools: web_search(query: str)"],
        expected_output="Should call web_search with relevant query."
    ),
    LLMTestCase(
        input="What is quantum computing?",
        actual_output="Quantum computing uses qubits that can be in superposition, enabling parallel computation.",
        retrieval_context=["Quantum computers use qubits. Qubits leverage superposition and entanglement."],
        expected_output="Quantum computing leverages qubits in superposition for parallel processing."
    )
]
 
# Define metrics with thresholds
metrics = [
    AnswerRelevancyMetric(threshold=0.8),
    FaithfulnessMetric(threshold=0.9),
    HallucinationMetric(threshold=0.1),  # Lower is better
]
 
# Run evaluation
results = evaluate(test_cases=test_cases, metrics=metrics)
 
# Results show per-test scores and pass/fail
for result in results:
    print(f"Test: {result.input[:50]}...")
    for metric_result in result.metrics:
        print(f"  {metric_result.name}: {metric_result.score:.2f} ({'PASS' if metric_result.success else 'FAIL'})")

Building a Custom Evaluation Framework

For agent-specific needs, build a lightweight eval harness.

import time
import json
from dataclasses import dataclass, field
from typing import Callable
 
@dataclass
class EvalCase:
    input: str
    expected_output: str = ""
    expected_tools: list = field(default_factory=list)
    max_steps: int = 10
    max_latency_seconds: float = 10.0
 
@dataclass
class EvalResult:
    case: EvalCase
    actual_output: str
    tools_called: list
    steps: int
    latency: float
    tokens_used: int
    success: bool
    cost: float
 
def evaluate_agent(
    agent_fn: Callable,
    cases: list[EvalCase],
    cost_per_token: float = 0.00001
) -> list[EvalResult]:
    results = []
    for case in cases:
        start = time.time()
        output = agent_fn(case.input)
        latency = time.time() - start
 
        # Extract metrics from output (adapt to your agent's return format)
        result = EvalResult(
            case=case,
            actual_output=output.get("response", ""),
            tools_called=output.get("tools_called", []),
            steps=output.get("steps", 0),
            latency=latency,
            tokens_used=output.get("tokens_used", 0),
            success=_check_success(case, output),
            cost=output.get("tokens_used", 0) * cost_per_token
        )
        results.append(result)
    return results
 
def _check_success(case: EvalCase, output: dict) -> bool:
    # Check tool accuracy
    if case.expected_tools:
        actual_tools = [t["name"] for t in output.get("tools_called", [])]
        if set(case.expected_tools) != set(actual_tools):
            return False
    # Check output similarity (simple substring check; use embeddings for production)
    if case.expected_output and case.expected_output.lower() not in output.get("response", "").lower():
        return False
    return True
 
def print_report(results: list[EvalResult]):
    total = len(results)
    successes = sum(1 for r in results if r.success)
    avg_latency = sum(r.latency for r in results) / total
    total_cost = sum(r.cost for r in results)
    avg_steps = sum(r.steps for r in results) / total
 
    print(f"=== Agent Evaluation Report ===")
    print(f"Task Success Rate: {successes}/{total} ({successes/total*100:.1f}%)")
    print(f"Average Latency:   {avg_latency:.2f}s")
    print(f"Average Steps:     {avg_steps:.1f}")
    print(f"Total Cost:        ${total_cost:.4f}")
    print(f"Cost per Task:     ${total_cost/total:.4f}")
 
# Usage
cases = [
    EvalCase(
        input="What is the weather in NYC?",
        expected_tools=["get_weather"],
        expected_output="72",
        max_latency_seconds=5.0
    ),
    EvalCase(
        input="Calculate 15% tip on $85",
        expected_tools=["calculator"],
        expected_output="12.75",
        max_latency_seconds=3.0
    )
]
 
# results = evaluate_agent(my_agent, cases)
# print_report(results)

A/B Testing Agents

Compare two agent versions on identical tasks with statistical rigor.

import numpy as np
from scipy import stats
from dataclasses import dataclass
 
@dataclass
class ABTestResult:
    agent_a_success_rate: float
    agent_b_success_rate: float
    p_value: float
    significant: bool
    winner: str
 
def ab_test_agents(
    agent_a_fn,
    agent_b_fn,
    test_cases: list[dict],
    significance_level: float = 0.05
) -> ABTestResult:
    a_results = []
    b_results = []
 
    for case in test_cases:
        a_output = agent_a_fn(case["input"])
        b_output = agent_b_fn(case["input"])
 
        a_success = case["expected"] in a_output.get("response", "")
        b_success = case["expected"] in b_output.get("response", "")
 
        a_results.append(int(a_success))
        b_results.append(int(b_success))
 
    a_rate = np.mean(a_results)
    b_rate = np.mean(b_results)
 
    # Two-proportion z-test
    t_stat, p_value = stats.ttest_ind(a_results, b_results)
 
    winner = "A" if a_rate > b_rate else "B" if b_rate > a_rate else "tie"
 
    return ABTestResult(
        agent_a_success_rate=a_rate,
        agent_b_success_rate=b_rate,
        p_value=p_value,
        significant=p_value < significance_level,
        winner=winner if p_value < significance_level else "no significant difference"
    )

Human-in-the-Loop Evaluation

Automate 80% of evaluation; have humans review the subjective 20%.

Process:

Run automated metrics (RAGAS, DeepEval) on all outputs
Flag outputs with low confidence scores (e.g., faithfulness < 0.7)
Route flagged outputs to human reviewers
Score on 1-5 scale for quality, helpfulness, safety
Target inter-annotator agreement > 0.8 Cohen's Kappa

Tools for human review:

Langfuse — Open-source LLM observability with annotation workflows
Label Studio — General annotation platform adaptable to LLM outputs
Argilla — Purpose-built for AI feedback and evaluation

Evaluation Strategy Decision Guide

graph TD A[What to evaluate?] --> B{RAG pipeline?} B -->|Yes| C[Use RAGAS] B -->|No| D{Agent with tools?} D -->|Yes| E[DeepEval + Custom harness] D -->|No| F{LLM quality only?} F -->|Yes| G[DeepEval metrics] C --> H{Need to compare versions?} E --> H G --> H H -->|Yes| I[A/B test framework] H -->|No| J[Continuous monitoring] I --> K[Add human review for edge cases] J --> K

AI Agent Knowledge Base

Sidebar

Table of Contents

How to Evaluate an Agent

Evaluation Framework Overview

Key Metrics

Benchmarks: Which One for What

Benchmark Decision Guide

Evaluating with RAGAS

Evaluating with DeepEval

Building a Custom Evaluation Framework

A/B Testing Agents

Human-in-the-Loop Evaluation

Evaluation Strategy Decision Guide

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

How to Evaluate an Agent

Evaluation Framework Overview

Key Metrics

Benchmarks: Which One for What

Benchmark Decision Guide

Evaluating with RAGAS

Evaluating with DeepEval

Building a Custom Evaluation Framework

A/B Testing Agents

Human-in-the-Loop Evaluation

Evaluation Strategy Decision Guide

See Also

Page Tools