Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
A practical guide to measuring AI agent quality. Covers key metrics, industry benchmarks, evaluation frameworks (RAGAS, DeepEval), and testing strategies from unit tests to A/B testing in production.
| Metric | What It Measures | Target | Red Flag | How to Compute |
|---|---|---|---|---|
| Task Success Rate | % of tasks fully completed | >87% | <72% | completed_tasks / total_tasks |
| Tool Call Accuracy | Correct tool selection + parameters | >95% | <80% | correct_calls / total_calls |
| Reasoning Quality | Faithfulness, minimal hallucination | Hallucination <3% | >10% | LLM-as-judge or human review |
| Latency | Time per response or full task | <4s per response | >10s | Measure end-to-end wall time |
| Cost per Task | Tokens + API calls + compute | Track and optimize | Unbounded growth | sum(tokens * price) per task |
| Benchmark | Focus Area | What It Tests | When to Use |
|---|---|---|---|
| SWE-bench | Coding agents | Resolve real GitHub issues end-to-end | Evaluating code generation/editing agents |
| GAIA | General agents | Real-world multi-step reasoning with tools | General-purpose agent evaluation |
| AgentBench | Multi-domain | Tool use, planning, persistence across domains | Broad agent capability assessment |
| WebArena | Web automation | Navigate real websites, complete tasks | Browser/web interaction agents |
| BrowseComp | Web research | Find specific information across the web | Research and information retrieval agents |
RAGAS is the standard for evaluating RAG pipelines. It measures faithfulness (is the answer grounded in context?), answer relevancy, and context precision.
pip install ragas datasets
from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from datasets import Dataset import os os.environ["OPENAI_API_KEY"] = "your-key" # Prepare evaluation dataset # Each row: question, contexts (retrieved), answer (agent output), ground_truth eval_data = { "question": [ "What is retrieval augmented generation?", "How does vector search work?", "What embedding models are best for RAG?" ], "contexts": [ ["RAG combines retrieval from a knowledge base with LLM generation to produce grounded answers."], ["Vector search uses approximate nearest neighbor algorithms like HNSW to find similar embeddings."], ["Popular embedding models include OpenAI text-embedding-3, Cohere embed-v3, and open-source BGE."] ], "answer": [ "RAG is a technique that retrieves relevant documents and uses them as context for an LLM to generate answers.", "Vector search converts text to embeddings and finds the closest vectors using algorithms like HNSW.", "The best embedding models for RAG include OpenAI's text-embedding-3-small and Cohere's embed-v3." ], "ground_truth": [ "RAG retrieves relevant documents from a knowledge base and provides them as context to an LLM for generation.", "Vector search embeds text as vectors and uses ANN algorithms to find similar items efficiently.", "Top embedding models include OpenAI text-embedding-3, Cohere embed-v3, and BGE for open-source." ] } dataset = Dataset.from_dict(eval_data) # Run evaluation scores = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print("RAGAS Scores:") print(f" Faithfulness: {scores['faithfulness']:.3f}") print(f" Answer Relevancy: {scores['answer_relevancy']:.3f}") print(f" Context Precision: {scores['context_precision']:.3f}") print(f" Context Recall: {scores['context_recall']:.3f}") # Targets: faithfulness > 0.9, relevancy > 0.85, precision > 0.8
DeepEval provides metrics for LLM outputs including reasoning quality, hallucination detection, and tool call accuracy.
pip install deepeval
from deepeval import evaluate from deepeval.metrics import ( AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, ToolCorrectnessMetric ) from deepeval.test_case import LLMTestCase import os os.environ["OPENAI_API_KEY"] = "your-key" # Define test cases test_cases = [ LLMTestCase( input="What is the weather in NYC?", actual_output="The weather in NYC is currently 72F and sunny.", retrieval_context=["NYC weather: 72F, sunny, humidity 45%"], expected_output="NYC is 72F and sunny." ), LLMTestCase( input="Search for Python tutorials", actual_output='<tool_call name="web_search">{"query": "Python tutorials"}</tool_call>', retrieval_context=["Available tools: web_search(query: str)"], expected_output="Should call web_search with relevant query." ), LLMTestCase( input="What is quantum computing?", actual_output="Quantum computing uses qubits that can be in superposition, enabling parallel computation.", retrieval_context=["Quantum computers use qubits. Qubits leverage superposition and entanglement."], expected_output="Quantum computing leverages qubits in superposition for parallel processing." ) ] # Define metrics with thresholds metrics = [ AnswerRelevancyMetric(threshold=0.8), FaithfulnessMetric(threshold=0.9), HallucinationMetric(threshold=0.1), # Lower is better ] # Run evaluation results = evaluate(test_cases=test_cases, metrics=metrics) # Results show per-test scores and pass/fail for result in results: print(f"Test: {result.input[:50]}...") for metric_result in result.metrics: print(f" {metric_result.name}: {metric_result.score:.2f} ({'PASS' if metric_result.success else 'FAIL'})")
For agent-specific needs, build a lightweight eval harness.
import time import json from dataclasses import dataclass, field from typing import Callable @dataclass class EvalCase: input: str expected_output: str = "" expected_tools: list = field(default_factory=list) max_steps: int = 10 max_latency_seconds: float = 10.0 @dataclass class EvalResult: case: EvalCase actual_output: str tools_called: list steps: int latency: float tokens_used: int success: bool cost: float def evaluate_agent( agent_fn: Callable, cases: list[EvalCase], cost_per_token: float = 0.00001 ) -> list[EvalResult]: results = [] for case in cases: start = time.time() output = agent_fn(case.input) latency = time.time() - start # Extract metrics from output (adapt to your agent's return format) result = EvalResult( case=case, actual_output=output.get("response", ""), tools_called=output.get("tools_called", []), steps=output.get("steps", 0), latency=latency, tokens_used=output.get("tokens_used", 0), success=_check_success(case, output), cost=output.get("tokens_used", 0) * cost_per_token ) results.append(result) return results def _check_success(case: EvalCase, output: dict) -> bool: # Check tool accuracy if case.expected_tools: actual_tools = [t["name"] for t in output.get("tools_called", [])] if set(case.expected_tools) != set(actual_tools): return False # Check output similarity (simple substring check; use embeddings for production) if case.expected_output and case.expected_output.lower() not in output.get("response", "").lower(): return False return True def print_report(results: list[EvalResult]): total = len(results) successes = sum(1 for r in results if r.success) avg_latency = sum(r.latency for r in results) / total total_cost = sum(r.cost for r in results) avg_steps = sum(r.steps for r in results) / total print(f"=== Agent Evaluation Report ===") print(f"Task Success Rate: {successes}/{total} ({successes/total*100:.1f}%)") print(f"Average Latency: {avg_latency:.2f}s") print(f"Average Steps: {avg_steps:.1f}") print(f"Total Cost: ${total_cost:.4f}") print(f"Cost per Task: ${total_cost/total:.4f}") # Usage cases = [ EvalCase( input="What is the weather in NYC?", expected_tools=["get_weather"], expected_output="72", max_latency_seconds=5.0 ), EvalCase( input="Calculate 15% tip on $85", expected_tools=["calculator"], expected_output="12.75", max_latency_seconds=3.0 ) ] # results = evaluate_agent(my_agent, cases) # print_report(results)
Compare two agent versions on identical tasks with statistical rigor.
import numpy as np from scipy import stats from dataclasses import dataclass @dataclass class ABTestResult: agent_a_success_rate: float agent_b_success_rate: float p_value: float significant: bool winner: str def ab_test_agents( agent_a_fn, agent_b_fn, test_cases: list[dict], significance_level: float = 0.05 ) -> ABTestResult: a_results = [] b_results = [] for case in test_cases: a_output = agent_a_fn(case["input"]) b_output = agent_b_fn(case["input"]) a_success = case["expected"] in a_output.get("response", "") b_success = case["expected"] in b_output.get("response", "") a_results.append(int(a_success)) b_results.append(int(b_success)) a_rate = np.mean(a_results) b_rate = np.mean(b_results) # Two-proportion z-test t_stat, p_value = stats.ttest_ind(a_results, b_results) winner = "A" if a_rate > b_rate else "B" if b_rate > a_rate else "tie" return ABTestResult( agent_a_success_rate=a_rate, agent_b_success_rate=b_rate, p_value=p_value, significant=p_value < significance_level, winner=winner if p_value < significance_level else "no significant difference" )
Automate 80% of evaluation; have humans review the subjective 20%.
Process:
Tools for human review:
evaluation metrics benchmarks ragas deepeval testing how-to