====== How to Monitor Agents ======

Monitoring AI agents in production requires observability into non-deterministic, multi-step workflows. Unlike traditional software where inputs map predictably to outputs, agents make autonomous decisions, call tools, and chain reasoning steps -- any of which can fail silently. This guide covers the observability stack, key metrics, tools, and implementation patterns.

===== Observability Fundamentals =====

Agent observability extends traditional APM (Application Performance Monitoring) with LLM-specific concepts:

=== Traces ===

A trace captures the entire lifecycle of an agent task -- from the initial user message through all reasoning steps, tool calls, and the final response. Each trace has a unique ID and contains multiple spans.

=== Spans ===

Spans represent individual operations within a trace:

  * LLM inference calls (prompt in, completion out, tokens used)
  * Tool/function executions (inputs, outputs, latency)
  * Retrieval operations (query, results, relevance scores)
  * Decision points (which branch the agent took and why)

=== Metrics ===

Aggregated quantitative data computed from spans: p50/p95 latency, error rates, token throughput, cost per request. ((Source: [[https://noveum.ai/en/blog/how-to-monitor-ai-agents-in-production|Noveum - Monitor AI Agents in Production]]))

===== Key Metrics =====

Track metrics across four categories:

^ Category ^ Metrics ^ Why It Matters ^
| Correctness | Faithfulness score, hallucination rate, answer relevancy, role adherence | Detects when the agent gives wrong or fabricated answers |
| Efficiency | p95 latency, steps to completion, token efficiency, tool call count | Identifies bottlenecks and runaway loops |
| Safety | Toxicity rate, PII leak rate, prompt injection attempts, guardrail trigger rate | Catches harmful outputs before they reach users |
| Business | Task completion rate, cost per session, user satisfaction score | Connects agent performance to business outcomes |

((Source: [[https://noveum.ai/en/blog/how-to-monitor-ai-agents-in-production|Noveum - Monitor AI Agents in Production]]))

===== Distributed Tracing =====

Agents are inherently distributed: they call external LLM APIs, tool endpoints, databases, and sometimes other agents. Use OpenTelemetry-based instrumentation to capture this:

  * Instrument every LLM call with input prompt, output text, model name, token counts, and latency
  * Instrument tool executions with function name, parameters, result, and duration
  * Propagate trace context across async boundaries and external service calls
  * Visualize traces as timelines to identify which step is the bottleneck

Start instrumenting from day one -- retrofitting tracing into a production agent is significantly harder. ((Source: [[https://www.getmaxim.ai/articles/top-5-ai-evaluation-tools-in-2025-comprehensive-comparison-for-production-ready-llm-and-agentic-systems-2/|Maxim - AI Evaluation Tools]]))

===== Cost Tracking =====

LLM costs can spike unexpectedly. Track:

  * **Token usage per session** -- input and output tokens separately (pricing differs)
  * **Cost per task** -- total spending attributed to each completed task
  * **Cost by model** -- break down spending across different models if routing is used
  * **Cost anomalies** -- alert when daily cost exceeds 2x the rolling average
  * **Token efficiency** -- tokens consumed relative to task complexity (are simple tasks burning too many tokens?)

Set hard budget limits per user, per task, and per day. Kill agent runs that exceed token budgets. ((Source: [[https://noveum.ai/en/blog/how-to-monitor-ai-agents-in-production|Noveum - Monitor AI Agents in Production]]))

===== Alerting =====

Configure proactive alerts for:

  * **Accuracy drops** -- faithfulness score falls below threshold (e.g., <7/10)
  * **Latency spikes** -- p95 response time exceeds SLA (e.g., >3 seconds)
  * **Error rate increases** -- tool failures or LLM errors exceed baseline
  * **Cost overruns** -- daily spend exceeds budget by a defined margin
  * **Safety triggers** -- toxicity, PII leaks, or injection attempts detected

Use a combination of statistical anomaly detection and hard threshold rules. Escalate to human review for flagged traces.

===== Tools and Platforms =====

^ Platform ^ Type ^ Key Strengths ^ Best For ^
| LangSmith | Commercial | End-to-end tracing, evaluations, datasets | LangChain/LangGraph ecosystems |
| Arize Phoenix | Open-source | Trace visualization, LLM eval frameworks | Teams wanting self-hosted observability |
| LangFuse | Open-source | Cost tracking, prompt management, alerting | Budget-conscious production monitoring |
| OpenLLMetry | Open-source | OpenTelemetry for LLMs, distributed traces | Teams already using OpenTelemetry |
| Helicone | Commercial | Real-time cost monitoring, provider-agnostic | Cost-focused monitoring |

=== LangSmith Integration ===

For LangChain-based agents, add tracing with minimal code:

  import os
  os.environ["LANGCHAIN_TRACING_V2"] = "true"
  os.environ["LANGCHAIN_API_KEY"] = "your-key"
  
  # All LangChain/LangGraph calls are now automatically traced

For custom agents, use the decorator:

  from langsmith import traceable
  
  @traceable
  def my_agent_step(input_text):
      # LLM call, tool execution, etc.
      return result

=== LangFuse Integration ===

  from langfuse import Langfuse
  
  langfuse = Langfuse(public_key="pk-...", secret_key="sk-...")
  trace = langfuse.trace(name="agent-task")
  span = trace.span(name="llm-call", input={"prompt": "..."})
  # ... execute LLM call ...
  span.end(output={"response": "...", "tokens": 150})

((Source: [[https://www.getmaxim.ai/articles/top-5-ai-evaluation-tools-in-2025-comprehensive-comparison-for-production-ready-llm-and-agentic-systems-2/|Maxim - AI Evaluation Tools]]))

===== Dashboards =====

Build dashboards that show:

  * **Overview** -- task completion rate, average latency, daily cost, error count
  * **Trace explorer** -- drill into individual traces to debug failures
  * **Cost trends** -- daily/weekly spending with forecasting
  * **Quality scores** -- faithfulness and relevancy scores over time
  * **Alerting history** -- triggered alerts and their resolution status

===== Best Practices =====

  * **Define success metrics before launch** -- latency SLA, accuracy threshold, cost budget
  * **Instrument from day one** -- do not wait until production to add observability
  * **Review traces regularly** -- sample 5-10% of traces weekly for manual quality review
  * **Version your evaluations** -- as the agent evolves, update evaluation criteria accordingly
  * **Use production data for improvement** -- curate high-quality traces as fine-tuning or few-shot data
  * **Canary deployments** -- roll out changes to a small user cohort first, monitor, then expand
  * **Integrate with existing SIEM** -- feed agent logs into security monitoring (Splunk, Datadog) for audit

===== See Also =====

  * [[how_to_implement_guardrails|How to Implement Guardrails]]
  * [[how_to_create_an_agent|How to Create an Agent]]
  * [[how_to_build_an_ai_assistant|How to Build an AI Assistant]]

===== References =====