====== How to Evaluate an Agent ======
A practical guide to measuring AI agent quality. Covers key metrics, industry benchmarks, evaluation frameworks (RAGAS, DeepEval), and testing strategies from unit tests to A/B testing in production.
===== Evaluation Framework Overview =====
graph LR
subgraph Metrics
A[Task Success Rate]
B[Tool Call Accuracy]
C[Reasoning Quality]
D[Latency]
E[Cost per Task]
end
subgraph Methods
F[Automated Evals]
G[Benchmark Suites]
H[Human Review]
I[A/B Testing]
end
subgraph Tools
J[RAGAS]
K[DeepEval]
L[Langfuse]
M[Custom Framework]
end
A --> F
B --> F
C --> H
D --> F
E --> F
F --> J
F --> K
G --> J
H --> L
I --> M
===== Key Metrics =====
^ Metric ^ What It Measures ^ Target ^ Red Flag ^ How to Compute ^
| **Task Success Rate** | % of tasks fully completed | >87% | <72% | completed_tasks / total_tasks |
| **Tool Call Accuracy** | Correct tool selection + parameters | >95% | <80% | correct_calls / total_calls |
| **Reasoning Quality** | Faithfulness, minimal hallucination | Hallucination <3% | >10% | LLM-as-judge or human review |
| **Latency** | Time per response or full task | <4s per response | >10s | Measure end-to-end wall time |
| **Cost per Task** | Tokens + API calls + compute | Track and optimize | Unbounded growth | sum(tokens * price) per task |
===== Benchmarks: Which One for What =====
^ Benchmark ^ Focus Area ^ What It Tests ^ When to Use ^
| **SWE-bench** | Coding agents | Resolve real GitHub issues end-to-end | Evaluating code generation/editing agents |
| **GAIA** | General agents | Real-world multi-step reasoning with tools | General-purpose agent evaluation |
| **AgentBench** | Multi-domain | Tool use, planning, persistence across domains | Broad agent capability assessment |
| **WebArena** | Web automation | Navigate real websites, complete tasks | Browser/web interaction agents |
| **BrowseComp** | Web research | Find specific information across the web | Research and information retrieval agents |
==== Benchmark Decision Guide ====
graph TD
A[What does your agent do?] --> B{Primary task?}
B -->|Write/edit code| C[SWE-bench]
B -->|Browse the web| D[WebArena]
B -->|General reasoning + tools| E[GAIA]
B -->|Multiple domains| F[AgentBench]
B -->|Research tasks| G[BrowseComp]
C --> H[Also add custom evals for your codebase]
D --> H
E --> H
F --> H
G --> H
===== Evaluating with RAGAS =====
RAGAS is the standard for evaluating RAG pipelines. It measures faithfulness (is the answer grounded in context?), answer relevancy, and context precision.
pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
import os
os.environ["OPENAI_API_KEY"] = "your-key"
# Prepare evaluation dataset
# Each row: question, contexts (retrieved), answer (agent output), ground_truth
eval_data = {
"question": [
"What is retrieval augmented generation?",
"How does vector search work?",
"What embedding models are best for RAG?"
],
"contexts": [
["RAG combines retrieval from a knowledge base with LLM generation to produce grounded answers."],
["Vector search uses approximate nearest neighbor algorithms like HNSW to find similar embeddings."],
["Popular embedding models include OpenAI text-embedding-3, Cohere embed-v3, and open-source BGE."]
],
"answer": [
"RAG is a technique that retrieves relevant documents and uses them as context for an LLM to generate answers.",
"Vector search converts text to embeddings and finds the closest vectors using algorithms like HNSW.",
"The best embedding models for RAG include OpenAI's text-embedding-3-small and Cohere's embed-v3."
],
"ground_truth": [
"RAG retrieves relevant documents from a knowledge base and provides them as context to an LLM for generation.",
"Vector search embeds text as vectors and uses ANN algorithms to find similar items efficiently.",
"Top embedding models include OpenAI text-embedding-3, Cohere embed-v3, and BGE for open-source."
]
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
scores = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print("RAGAS Scores:")
print(f" Faithfulness: {scores['faithfulness']:.3f}")
print(f" Answer Relevancy: {scores['answer_relevancy']:.3f}")
print(f" Context Precision: {scores['context_precision']:.3f}")
print(f" Context Recall: {scores['context_recall']:.3f}")
# Targets: faithfulness > 0.9, relevancy > 0.85, precision > 0.8
===== Evaluating with DeepEval =====
DeepEval provides metrics for LLM outputs including reasoning quality, hallucination detection, and tool call accuracy.
pip install deepeval
from deepeval import evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
ToolCorrectnessMetric
)
from deepeval.test_case import LLMTestCase
import os
os.environ["OPENAI_API_KEY"] = "your-key"
# Define test cases
test_cases = [
LLMTestCase(
input="What is the weather in NYC?",
actual_output="The weather in NYC is currently 72F and sunny.",
retrieval_context=["NYC weather: 72F, sunny, humidity 45%"],
expected_output="NYC is 72F and sunny."
),
LLMTestCase(
input="Search for Python tutorials",
actual_output='{"query": "Python tutorials"}',
retrieval_context=["Available tools: web_search(query: str)"],
expected_output="Should call web_search with relevant query."
),
LLMTestCase(
input="What is quantum computing?",
actual_output="Quantum computing uses qubits that can be in superposition, enabling parallel computation.",
retrieval_context=["Quantum computers use qubits. Qubits leverage superposition and entanglement."],
expected_output="Quantum computing leverages qubits in superposition for parallel processing."
)
]
# Define metrics with thresholds
metrics = [
AnswerRelevancyMetric(threshold=0.8),
FaithfulnessMetric(threshold=0.9),
HallucinationMetric(threshold=0.1), # Lower is better
]
# Run evaluation
results = evaluate(test_cases=test_cases, metrics=metrics)
# Results show per-test scores and pass/fail
for result in results:
print(f"Test: {result.input[:50]}...")
for metric_result in result.metrics:
print(f" {metric_result.name}: {metric_result.score:.2f} ({'PASS' if metric_result.success else 'FAIL'})")
===== Building a Custom Evaluation Framework =====
For agent-specific needs, build a lightweight eval harness.
import time
import json
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class EvalCase:
input: str
expected_output: str = ""
expected_tools: list = field(default_factory=list)
max_steps: int = 10
max_latency_seconds: float = 10.0
@dataclass
class EvalResult:
case: EvalCase
actual_output: str
tools_called: list
steps: int
latency: float
tokens_used: int
success: bool
cost: float
def evaluate_agent(
agent_fn: Callable,
cases: list[EvalCase],
cost_per_token: float = 0.00001
) -> list[EvalResult]:
results = []
for case in cases:
start = time.time()
output = agent_fn(case.input)
latency = time.time() - start
# Extract metrics from output (adapt to your agent's return format)
result = EvalResult(
case=case,
actual_output=output.get("response", ""),
tools_called=output.get("tools_called", []),
steps=output.get("steps", 0),
latency=latency,
tokens_used=output.get("tokens_used", 0),
success=_check_success(case, output),
cost=output.get("tokens_used", 0) * cost_per_token
)
results.append(result)
return results
def _check_success(case: EvalCase, output: dict) -> bool:
# Check tool accuracy
if case.expected_tools:
actual_tools = [t["name"] for t in output.get("tools_called", [])]
if set(case.expected_tools) != set(actual_tools):
return False
# Check output similarity (simple substring check; use embeddings for production)
if case.expected_output and case.expected_output.lower() not in output.get("response", "").lower():
return False
return True
def print_report(results: list[EvalResult]):
total = len(results)
successes = sum(1 for r in results if r.success)
avg_latency = sum(r.latency for r in results) / total
total_cost = sum(r.cost for r in results)
avg_steps = sum(r.steps for r in results) / total
print(f"=== Agent Evaluation Report ===")
print(f"Task Success Rate: {successes}/{total} ({successes/total*100:.1f}%)")
print(f"Average Latency: {avg_latency:.2f}s")
print(f"Average Steps: {avg_steps:.1f}")
print(f"Total Cost: ${total_cost:.4f}")
print(f"Cost per Task: ${total_cost/total:.4f}")
# Usage
cases = [
EvalCase(
input="What is the weather in NYC?",
expected_tools=["get_weather"],
expected_output="72",
max_latency_seconds=5.0
),
EvalCase(
input="Calculate 15% tip on $85",
expected_tools=["calculator"],
expected_output="12.75",
max_latency_seconds=3.0
)
]
# results = evaluate_agent(my_agent, cases)
# print_report(results)
===== A/B Testing Agents =====
Compare two agent versions on identical tasks with statistical rigor.
import numpy as np
from scipy import stats
from dataclasses import dataclass
@dataclass
class ABTestResult:
agent_a_success_rate: float
agent_b_success_rate: float
p_value: float
significant: bool
winner: str
def ab_test_agents(
agent_a_fn,
agent_b_fn,
test_cases: list[dict],
significance_level: float = 0.05
) -> ABTestResult:
a_results = []
b_results = []
for case in test_cases:
a_output = agent_a_fn(case["input"])
b_output = agent_b_fn(case["input"])
a_success = case["expected"] in a_output.get("response", "")
b_success = case["expected"] in b_output.get("response", "")
a_results.append(int(a_success))
b_results.append(int(b_success))
a_rate = np.mean(a_results)
b_rate = np.mean(b_results)
# Two-proportion z-test
t_stat, p_value = stats.ttest_ind(a_results, b_results)
winner = "A" if a_rate > b_rate else "B" if b_rate > a_rate else "tie"
return ABTestResult(
agent_a_success_rate=a_rate,
agent_b_success_rate=b_rate,
p_value=p_value,
significant=p_value < significance_level,
winner=winner if p_value < significance_level else "no significant difference"
)
===== Human-in-the-Loop Evaluation =====
Automate 80% of evaluation; have humans review the subjective 20%.
**Process:**
- Run automated metrics (RAGAS, DeepEval) on all outputs
- Flag outputs with low confidence scores (e.g., faithfulness < 0.7)
- Route flagged outputs to human reviewers
- Score on 1-5 scale for quality, helpfulness, safety
- Target inter-annotator agreement > 0.8 Cohen's Kappa
**Tools for human review:**
* **Langfuse** — Open-source LLM observability with annotation workflows
* **Label Studio** — General annotation platform adaptable to LLM outputs
* **Argilla** — Purpose-built for AI feedback and evaluation
===== Evaluation Strategy Decision Guide =====
graph TD
A[What to evaluate?] --> B{RAG pipeline?}
B -->|Yes| C[Use RAGAS]
B -->|No| D{Agent with tools?}
D -->|Yes| E[DeepEval + Custom harness]
D -->|No| F{LLM quality only?}
F -->|Yes| G[DeepEval metrics]
C --> H{Need to compare versions?}
E --> H
G --> H
H -->|Yes| I[A/B test framework]
H -->|No| J[Continuous monitoring]
I --> K[Add human review for edge cases]
J --> K
===== See Also =====
* [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
* [[how_to_deploy_an_agent|How to Deploy an Agent]]
* [[how_to_add_memory_to_an_agent|How to Add Memory to an Agent]]
{{tag>evaluation metrics benchmarks ragas deepeval testing how-to}}