====== How to Evaluate an Agent ======

A practical guide to measuring AI agent quality. Covers key metrics, industry benchmarks, evaluation frameworks (RAGAS, DeepEval), and testing strategies from unit tests to A/B testing in production.

===== Evaluation Framework Overview =====

<mermaid>
graph LR
    subgraph Metrics
        A[Task Success Rate]
        B[Tool Call Accuracy]
        C[Reasoning Quality]
        D[Latency]
        E[Cost per Task]
    end
    subgraph Methods
        F[Automated Evals]
        G[Benchmark Suites]
        H[Human Review]
        I[A/B Testing]
    end
    subgraph Tools
        J[RAGAS]
        K[DeepEval]
        L[Langfuse]
        M[Custom Framework]
    end
    A --> F
    B --> F
    C --> H
    D --> F
    E --> F
    F --> J
    F --> K
    G --> J
    H --> L
    I --> M
</mermaid>

===== Key Metrics =====

^ Metric ^ What It Measures ^ Target ^ Red Flag ^ How to Compute ^
| **Task Success Rate** | % of tasks fully completed | >87% | <72% | completed_tasks / total_tasks |
| **Tool Call Accuracy** | Correct tool selection + parameters | >95% | <80% | correct_calls / total_calls |
| **Reasoning Quality** | Faithfulness, minimal hallucination | Hallucination <3% | >10% | LLM-as-judge or human review |
| **Latency** | Time per response or full task | <4s per response | >10s | Measure end-to-end wall time |
| **Cost per Task** | Tokens + API calls + compute | Track and optimize | Unbounded growth | sum(tokens * price) per task |

===== Benchmarks: Which One for What =====

^ Benchmark ^ Focus Area ^ What It Tests ^ When to Use ^
| **SWE-bench** | Coding agents | Resolve real GitHub issues end-to-end | Evaluating code generation/editing agents |
| **GAIA** | General agents | Real-world multi-step reasoning with tools | General-purpose agent evaluation |
| **AgentBench** | Multi-domain | Tool use, planning, persistence across domains | Broad agent capability assessment |
| **WebArena** | Web automation | Navigate real websites, complete tasks | Browser/web interaction agents |
| **BrowseComp** | Web research | Find specific information across the web | Research and information retrieval agents |

==== Benchmark Decision Guide ====

<mermaid>
graph TD
    A[What does your agent do?] --> B{Primary task?}
    B -->|Write/edit code| C[SWE-bench]
    B -->|Browse the web| D[WebArena]
    B -->|General reasoning + tools| E[GAIA]
    B -->|Multiple domains| F[AgentBench]
    B -->|Research tasks| G[BrowseComp]
    C --> H[Also add custom evals for your codebase]
    D --> H
    E --> H
    F --> H
    G --> H
</mermaid>

===== Evaluating with RAGAS =====

RAGAS is the standard for evaluating RAG pipelines. It measures faithfulness (is the answer grounded in context?), answer relevancy, and context precision.

<code bash>
pip install ragas datasets
</code>

<code python>
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "your-key"

# Prepare evaluation dataset
# Each row: question, contexts (retrieved), answer (agent output), ground_truth
eval_data = {
    "question": [
        "What is retrieval augmented generation?",
        "How does vector search work?",
        "What embedding models are best for RAG?"
    ],
    "contexts": [
        ["RAG combines retrieval from a knowledge base with LLM generation to produce grounded answers."],
        ["Vector search uses approximate nearest neighbor algorithms like HNSW to find similar embeddings."],
        ["Popular embedding models include OpenAI text-embedding-3, Cohere embed-v3, and open-source BGE."]
    ],
    "answer": [
        "RAG is a technique that retrieves relevant documents and uses them as context for an LLM to generate answers.",
        "Vector search converts text to embeddings and finds the closest vectors using algorithms like HNSW.",
        "The best embedding models for RAG include OpenAI's text-embedding-3-small and Cohere's embed-v3."
    ],
    "ground_truth": [
        "RAG retrieves relevant documents from a knowledge base and provides them as context to an LLM for generation.",
        "Vector search embeds text as vectors and uses ANN algorithms to find similar items efficiently.",
        "Top embedding models include OpenAI text-embedding-3, Cohere embed-v3, and BGE for open-source."
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
scores = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print("RAGAS Scores:")
print(f"  Faithfulness:      {scores['faithfulness']:.3f}")
print(f"  Answer Relevancy:  {scores['answer_relevancy']:.3f}")
print(f"  Context Precision: {scores['context_precision']:.3f}")
print(f"  Context Recall:    {scores['context_recall']:.3f}")

# Targets: faithfulness > 0.9, relevancy > 0.85, precision > 0.8
</code>

===== Evaluating with DeepEval =====

DeepEval provides metrics for LLM outputs including reasoning quality, hallucination detection, and tool call accuracy.

<code bash>
pip install deepeval
</code>

<code python>
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    ToolCorrectnessMetric
)
from deepeval.test_case import LLMTestCase
import os

os.environ["OPENAI_API_KEY"] = "your-key"

# Define test cases
test_cases = [
    LLMTestCase(
        input="What is the weather in NYC?",
        actual_output="The weather in NYC is currently 72F and sunny.",
        retrieval_context=["NYC weather: 72F, sunny, humidity 45%"],
        expected_output="NYC is 72F and sunny."
    ),
    LLMTestCase(
        input="Search for Python tutorials",
        actual_output='<tool_call name="web_search">{"query": "Python tutorials"}</tool_call>',
        retrieval_context=["Available tools: web_search(query: str)"],
        expected_output="Should call web_search with relevant query."
    ),
    LLMTestCase(
        input="What is quantum computing?",
        actual_output="Quantum computing uses qubits that can be in superposition, enabling parallel computation.",
        retrieval_context=["Quantum computers use qubits. Qubits leverage superposition and entanglement."],
        expected_output="Quantum computing leverages qubits in superposition for parallel processing."
    )
]

# Define metrics with thresholds
metrics = [
    AnswerRelevancyMetric(threshold=0.8),
    FaithfulnessMetric(threshold=0.9),
    HallucinationMetric(threshold=0.1),  # Lower is better
]

# Run evaluation
results = evaluate(test_cases=test_cases, metrics=metrics)

# Results show per-test scores and pass/fail
for result in results:
    print(f"Test: {result.input[:50]}...")
    for metric_result in result.metrics:
        print(f"  {metric_result.name}: {metric_result.score:.2f} ({'PASS' if metric_result.success else 'FAIL'})")
</code>

===== Building a Custom Evaluation Framework =====

For agent-specific needs, build a lightweight eval harness.

<code python>
import time
import json
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class EvalCase:
    input: str
    expected_output: str = ""
    expected_tools: list = field(default_factory=list)
    max_steps: int = 10
    max_latency_seconds: float = 10.0

@dataclass
class EvalResult:
    case: EvalCase
    actual_output: str
    tools_called: list
    steps: int
    latency: float
    tokens_used: int
    success: bool
    cost: float

def evaluate_agent(
    agent_fn: Callable,
    cases: list[EvalCase],
    cost_per_token: float = 0.00001
) -> list[EvalResult]:
    results = []
    for case in cases:
        start = time.time()
        output = agent_fn(case.input)
        latency = time.time() - start

        # Extract metrics from output (adapt to your agent's return format)
        result = EvalResult(
            case=case,
            actual_output=output.get("response", ""),
            tools_called=output.get("tools_called", []),
            steps=output.get("steps", 0),
            latency=latency,
            tokens_used=output.get("tokens_used", 0),
            success=_check_success(case, output),
            cost=output.get("tokens_used", 0) * cost_per_token
        )
        results.append(result)
    return results

def _check_success(case: EvalCase, output: dict) -> bool:
    # Check tool accuracy
    if case.expected_tools:
        actual_tools = [t["name"] for t in output.get("tools_called", [])]
        if set(case.expected_tools) != set(actual_tools):
            return False
    # Check output similarity (simple substring check; use embeddings for production)
    if case.expected_output and case.expected_output.lower() not in output.get("response", "").lower():
        return False
    return True

def print_report(results: list[EvalResult]):
    total = len(results)
    successes = sum(1 for r in results if r.success)
    avg_latency = sum(r.latency for r in results) / total
    total_cost = sum(r.cost for r in results)
    avg_steps = sum(r.steps for r in results) / total

    print(f"=== Agent Evaluation Report ===")
    print(f"Task Success Rate: {successes}/{total} ({successes/total*100:.1f}%)")
    print(f"Average Latency:   {avg_latency:.2f}s")
    print(f"Average Steps:     {avg_steps:.1f}")
    print(f"Total Cost:        ${total_cost:.4f}")
    print(f"Cost per Task:     ${total_cost/total:.4f}")

# Usage
cases = [
    EvalCase(
        input="What is the weather in NYC?",
        expected_tools=["get_weather"],
        expected_output="72",
        max_latency_seconds=5.0
    ),
    EvalCase(
        input="Calculate 15% tip on $85",
        expected_tools=["calculator"],
        expected_output="12.75",
        max_latency_seconds=3.0
    )
]

# results = evaluate_agent(my_agent, cases)
# print_report(results)
</code>

===== A/B Testing Agents =====

Compare two agent versions on identical tasks with statistical rigor.

<code python>
import numpy as np
from scipy import stats
from dataclasses import dataclass

@dataclass
class ABTestResult:
    agent_a_success_rate: float
    agent_b_success_rate: float
    p_value: float
    significant: bool
    winner: str

def ab_test_agents(
    agent_a_fn,
    agent_b_fn,
    test_cases: list[dict],
    significance_level: float = 0.05
) -> ABTestResult:
    a_results = []
    b_results = []

    for case in test_cases:
        a_output = agent_a_fn(case["input"])
        b_output = agent_b_fn(case["input"])

        a_success = case["expected"] in a_output.get("response", "")
        b_success = case["expected"] in b_output.get("response", "")

        a_results.append(int(a_success))
        b_results.append(int(b_success))

    a_rate = np.mean(a_results)
    b_rate = np.mean(b_results)

    # Two-proportion z-test
    t_stat, p_value = stats.ttest_ind(a_results, b_results)

    winner = "A" if a_rate > b_rate else "B" if b_rate > a_rate else "tie"

    return ABTestResult(
        agent_a_success_rate=a_rate,
        agent_b_success_rate=b_rate,
        p_value=p_value,
        significant=p_value < significance_level,
        winner=winner if p_value < significance_level else "no significant difference"
    )
</code>

===== Human-in-the-Loop Evaluation =====

Automate 80% of evaluation; have humans review the subjective 20%.

**Process:**
  - Run automated metrics (RAGAS, DeepEval) on all outputs
  - Flag outputs with low confidence scores (e.g., faithfulness < 0.7)
  - Route flagged outputs to human reviewers
  - Score on 1-5 scale for quality, helpfulness, safety
  - Target inter-annotator agreement > 0.8 Cohen's Kappa

**Tools for human review:**
  * **Langfuse** — Open-source LLM observability with annotation workflows
  * **Label Studio** — General annotation platform adaptable to LLM outputs
  * **Argilla** — Purpose-built for AI feedback and evaluation

===== Evaluation Strategy Decision Guide =====

<mermaid>
graph TD
    A[What to evaluate?] --> B{RAG pipeline?}
    B -->|Yes| C[Use RAGAS]
    B -->|No| D{Agent with tools?}
    D -->|Yes| E[DeepEval + Custom harness]
    D -->|No| F{LLM quality only?}
    F -->|Yes| G[DeepEval metrics]
    C --> H{Need to compare versions?}
    E --> H
    G --> H
    H -->|Yes| I[A/B test framework]
    H -->|No| J[Continuous monitoring]
    I --> K[Add human review for edge cases]
    J --> K
</mermaid>

===== See Also =====

  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[how_to_deploy_an_agent|How to Deploy an Agent]]
  * [[how_to_add_memory_to_an_agent|How to Add Memory to an Agent]]

{{tag>evaluation metrics benchmarks ragas deepeval testing how-to}}