Monitoring AI agents in production requires observability into non-deterministic, multi-step workflows. Unlike traditional software where inputs map predictably to outputs, agents make autonomous decisions, call tools, and chain reasoning steps – any of which can fail silently. This guide covers the observability stack, key metrics, tools, and implementation patterns.
Agent observability extends traditional APM (Application Performance Monitoring) with LLM-specific concepts:
A trace captures the entire lifecycle of an agent task – from the initial user message through all reasoning steps, tool calls, and the final response. Each trace has a unique ID and contains multiple spans.
Spans represent individual operations within a trace:
Aggregated quantitative data computed from spans: p50/p95 latency, error rates, token throughput, cost per request. 1)
Track metrics across four categories:
| Category | Metrics | Why It Matters |
|---|---|---|
| Correctness | Faithfulness score, hallucination rate, answer relevancy, role adherence | Detects when the agent gives wrong or fabricated answers |
| Efficiency | p95 latency, steps to completion, token efficiency, tool call count | Identifies bottlenecks and runaway loops |
| Safety | Toxicity rate, PII leak rate, prompt injection attempts, guardrail trigger rate | Catches harmful outputs before they reach users |
| Business | Task completion rate, cost per session, user satisfaction score | Connects agent performance to business outcomes |
Agents are inherently distributed: they call external LLM APIs, tool endpoints, databases, and sometimes other agents. Use OpenTelemetry-based instrumentation to capture this:
Start instrumenting from day one – retrofitting tracing into a production agent is significantly harder. 3)
LLM costs can spike unexpectedly. Track:
Set hard budget limits per user, per task, and per day. Kill agent runs that exceed token budgets. 4)
Configure proactive alerts for:
Use a combination of statistical anomaly detection and hard threshold rules. Escalate to human review for flagged traces.
| Platform | Type | Key Strengths | Best For |
|---|---|---|---|
| LangSmith | Commercial | End-to-end tracing, evaluations, datasets | LangChain/LangGraph ecosystems |
| Arize Phoenix | Open-source | Trace visualization, LLM eval frameworks | Teams wanting self-hosted observability |
| LangFuse | Open-source | Cost tracking, prompt management, alerting | Budget-conscious production monitoring |
| OpenLLMetry | Open-source | OpenTelemetry for LLMs, distributed traces | Teams already using OpenTelemetry |
| Helicone | Commercial | Real-time cost monitoring, provider-agnostic | Cost-focused monitoring |
For LangChain-based agents, add tracing with minimal code:
import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "your-key" # All LangChain/LangGraph calls are now automatically traced
For custom agents, use the decorator:
from langsmith import traceable
@traceable
def my_agent_step(input_text):
# LLM call, tool execution, etc.
return result
from langfuse import Langfuse
langfuse = Langfuse(public_key="pk-...", secret_key="sk-...")
trace = langfuse.trace(name="agent-task")
span = trace.span(name="llm-call", input={"prompt": "..."})
# ... execute LLM call ...
span.end(output={"response": "...", "tokens": 150})
Build dashboards that show: