Observability Fundamentals
Key Metrics
Distributed Tracing
Cost Tracking
Alerting
Tools and Platforms
Dashboards
Best Practices
See Also
References

How to Monitor Agents

Monitoring AI agents in production requires observability into non-deterministic, multi-step workflows. Unlike traditional software where inputs map predictably to outputs, agents make autonomous decisions, call tools, and chain reasoning steps – any of which can fail silently. This guide covers the observability stack, key metrics, tools, and implementation patterns.

Observability Fundamentals

Agent observability extends traditional APM (Application Performance Monitoring) with LLM-specific concepts:

Traces

A trace captures the entire lifecycle of an agent task – from the initial user message through all reasoning steps, tool calls, and the final response. Each trace has a unique ID and contains multiple spans.

Spans

Spans represent individual operations within a trace:

LLM inference calls (prompt in, completion out, tokens used)
Tool/function executions (inputs, outputs, latency)
Retrieval operations (query, results, relevance scores)
Decision points (which branch the agent took and why)

Metrics

Aggregated quantitative data computed from spans: p50/p95 latency, error rates, token throughput, cost per request. ¹⁾

Key Metrics

Track metrics across four categories:

Category	Metrics	Why It Matters
Correctness	Faithfulness score, hallucination rate, answer relevancy, role adherence	Detects when the agent gives wrong or fabricated answers
Efficiency	p95 latency, steps to completion, token efficiency, tool call count	Identifies bottlenecks and runaway loops
Safety	Toxicity rate, PII leak rate, prompt injection attempts, guardrail trigger rate	Catches harmful outputs before they reach users
Business	Task completion rate, cost per session, user satisfaction score	Connects agent performance to business outcomes

²⁾

Distributed Tracing

Agents are inherently distributed: they call external LLM APIs, tool endpoints, databases, and sometimes other agents. Use OpenTelemetry-based instrumentation to capture this:

Instrument every LLM call with input prompt, output text, model name, token counts, and latency
Instrument tool executions with function name, parameters, result, and duration
Propagate trace context across async boundaries and external service calls
Visualize traces as timelines to identify which step is the bottleneck

Start instrumenting from day one – retrofitting tracing into a production agent is significantly harder. ³⁾

Cost Tracking

LLM costs can spike unexpectedly. Track:

Token usage per session – input and output tokens separately (pricing differs)
Cost per task – total spending attributed to each completed task
Cost by model – break down spending across different models if routing is used
Cost anomalies – alert when daily cost exceeds 2x the rolling average
Token efficiency – tokens consumed relative to task complexity (are simple tasks burning too many tokens?)

Set hard budget limits per user, per task, and per day. Kill agent runs that exceed token budgets. ⁴⁾

Alerting

Configure proactive alerts for:

Accuracy drops – faithfulness score falls below threshold (e.g., <7/10)
Latency spikes – p95 response time exceeds SLA (e.g., >3 seconds)
Error rate increases – tool failures or LLM errors exceed baseline
Cost overruns – daily spend exceeds budget by a defined margin
Safety triggers – toxicity, PII leaks, or injection attempts detected

Use a combination of statistical anomaly detection and hard threshold rules. Escalate to human review for flagged traces.

Tools and Platforms

Platform	Type	Key Strengths	Best For
LangSmith	Commercial	End-to-end tracing, evaluations, datasets	LangChain/LangGraph ecosystems
Arize Phoenix	Open-source	Trace visualization, LLM eval frameworks	Teams wanting self-hosted observability
LangFuse	Open-source	Cost tracking, prompt management, alerting	Budget-conscious production monitoring
OpenLLMetry	Open-source	OpenTelemetry for LLMs, distributed traces	Teams already using OpenTelemetry
Helicone	Commercial	Real-time cost monitoring, provider-agnostic	Cost-focused monitoring

LangSmith Integration

For LangChain-based agents, add tracing with minimal code:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"

# All LangChain/LangGraph calls are now automatically traced

For custom agents, use the decorator:

from langsmith import traceable

@traceable
def my_agent_step(input_text):
    # LLM call, tool execution, etc.
    return result

LangFuse Integration

from langfuse import Langfuse

langfuse = Langfuse(public_key="pk-...", secret_key="sk-...")
trace = langfuse.trace(name="agent-task")
span = trace.span(name="llm-call", input={"prompt": "..."})
# ... execute LLM call ...
span.end(output={"response": "...", "tokens": 150})

⁵⁾

Dashboards

Build dashboards that show:

Overview – task completion rate, average latency, daily cost, error count
Trace explorer – drill into individual traces to debug failures
Cost trends – daily/weekly spending with forecasting
Quality scores – faithfulness and relevancy scores over time
Alerting history – triggered alerts and their resolution status

Best Practices

Define success metrics before launch – latency SLA, accuracy threshold, cost budget
Instrument from day one – do not wait until production to add observability
Review traces regularly – sample 5-10% of traces weekly for manual quality review
Version your evaluations – as the agent evolves, update evaluation criteria accordingly
Use production data for improvement – curate high-quality traces as fine-tuning or few-shot data
Canary deployments – roll out changes to a small user cohort first, monitor, then expand
Integrate with existing SIEM – feed agent logs into security monitoring (Splunk, Datadog) for audit

References

¹⁾ , ²⁾ , ⁴⁾

Source: Noveum - Monitor AI Agents in Production

³⁾ , ⁵⁾

Source: Maxim - AI Evaluation Tools

Table of Contents