Table of Contents

How to Monitor Agents

Monitoring AI agents in production requires observability into non-deterministic, multi-step workflows. Unlike traditional software where inputs map predictably to outputs, agents make autonomous decisions, call tools, and chain reasoning steps – any of which can fail silently. This guide covers the observability stack, key metrics, tools, and implementation patterns.

Observability Fundamentals

Agent observability extends traditional APM (Application Performance Monitoring) with LLM-specific concepts:

Traces

A trace captures the entire lifecycle of an agent task – from the initial user message through all reasoning steps, tool calls, and the final response. Each trace has a unique ID and contains multiple spans.

Spans

Spans represent individual operations within a trace:

Metrics

Aggregated quantitative data computed from spans: p50/p95 latency, error rates, token throughput, cost per request. 1)

Key Metrics

Track metrics across four categories:

Category Metrics Why It Matters
Correctness Faithfulness score, hallucination rate, answer relevancy, role adherence Detects when the agent gives wrong or fabricated answers
Efficiency p95 latency, steps to completion, token efficiency, tool call count Identifies bottlenecks and runaway loops
Safety Toxicity rate, PII leak rate, prompt injection attempts, guardrail trigger rate Catches harmful outputs before they reach users
Business Task completion rate, cost per session, user satisfaction score Connects agent performance to business outcomes

2)

Distributed Tracing

Agents are inherently distributed: they call external LLM APIs, tool endpoints, databases, and sometimes other agents. Use OpenTelemetry-based instrumentation to capture this:

Start instrumenting from day one – retrofitting tracing into a production agent is significantly harder. 3)

Cost Tracking

LLM costs can spike unexpectedly. Track:

Set hard budget limits per user, per task, and per day. Kill agent runs that exceed token budgets. 4)

Alerting

Configure proactive alerts for:

Use a combination of statistical anomaly detection and hard threshold rules. Escalate to human review for flagged traces.

Tools and Platforms

Platform Type Key Strengths Best For
LangSmith Commercial End-to-end tracing, evaluations, datasets LangChain/LangGraph ecosystems
Arize Phoenix Open-source Trace visualization, LLM eval frameworks Teams wanting self-hosted observability
LangFuse Open-source Cost tracking, prompt management, alerting Budget-conscious production monitoring
OpenLLMetry Open-source OpenTelemetry for LLMs, distributed traces Teams already using OpenTelemetry
Helicone Commercial Real-time cost monitoring, provider-agnostic Cost-focused monitoring

LangSmith Integration

For LangChain-based agents, add tracing with minimal code:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"

# All LangChain/LangGraph calls are now automatically traced

For custom agents, use the decorator:

from langsmith import traceable

@traceable
def my_agent_step(input_text):
    # LLM call, tool execution, etc.
    return result

LangFuse Integration

from langfuse import Langfuse

langfuse = Langfuse(public_key="pk-...", secret_key="sk-...")
trace = langfuse.trace(name="agent-task")
span = trace.span(name="llm-call", input={"prompt": "..."})
# ... execute LLM call ...
span.end(output={"response": "...", "tokens": 150})

5)

Dashboards

Build dashboards that show:

Best Practices

See Also

References