====== Agent Observability ======
Agent observability is the practice of monitoring, tracing, and analyzing AI agent behavior in **production environments**. It encompasses distributed tracing of execution paths, real-time metrics (latency, token usage, costs, errors), and behavioral analysis to ensure agents operate reliably at scale. As of 2026, 89% of organizations deploying agents use observability tooling.((Maxim. "Top AI Agent Observability Platforms 2026." [[https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/|getmaxim.ai]]))(([[https://aimultiple.com/agentic-monitoring|AIMultiple - Agentic Monitoring Comparison]]))

This page covers production monitoring. For development-time debugging, see [[agent_debugging]].

===== Overview =====
AI agents in production present unique observability challenges compared to traditional software:

  * **Non-deterministic outputs** - Same input can produce different execution paths and results
  * **Multi-step workflows** - A single user request may trigger dozens of tool calls, LLM invocations, and decision branches
  * **Cost unpredictability** - Token usage varies dramatically based on task complexity and agent reasoning depth
  * **Cascading failures** - Errors in one tool call can propagate through the entire agent chain
  * **Behavioral drift** - Agent behavior can shift subtly over time as underlying models update

===== Core Observability Pillars =====
=== Distributed Tracing ===
Captures the full execution path from user input to final response, including every tool call, LLM invocation, decision point, and nested span in multi-agent workflows. Trace trees allow engineers to inspect inputs, outputs, timing, and costs at each step.

=== Latency Monitoring ===
Tracks response times, time-to-first-token, duration per step, and workflow bottlenecks. Real-time dashboards and alerts flag regressions before they impact users.

=== Cost Tracking ===
Monitors token usage, model costs per request, and efficiency across workflows. Optimization features include [[prompt_caching|prompt caching]], multi-provider routing, and cost-per-outcome analysis.

=== Behavioral Analysis ===
Validates tool usage patterns, step sequences, loops, and drift using trajectory monitors, cluster analysis, and LLM-based evaluations. Detects when agents deviate from expected behavior patterns.((Monte Carlo. "Agent Observability." [[https://www.montecarlodata.com/blog-agent-observability-announcement-features/|montecarlodata.com]]))

=== Quality Evaluation ===
Pre- and post-deployment checks against golden datasets, anomaly detection, safety blocks, and continuous production data scoring ensure output quality remains consistent.

===== OpenTelemetry for Agents =====
**OpenTelemetry** provides vendor-agnostic, standards-based tracing that serves as the foundation for agent observability. It enables framework-independent instrumentation in hybrid setups where agents span multiple services and providers.

<code python>
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Initialize OpenTelemetry for agent tracing
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.orchestrator")

class ObservableAgent:
    def __init__(self, model, tools):
        self.model = model
        self.tools = tools

    async def run(self, user_input):
        with tracer.start_as_current_span("agent.run") as root_span:
            root_span.set_attribute("agent.input", user_input)
            root_span.set_attribute("agent.model", self.model.name)

            messages = [{"role": "user", "content": user_input}]
            total_tokens = 0

            for step in range(self.max_steps):
                with tracer.start_as_current_span(f"agent.step.{step}") as step_span:
                    # Track LLM call
                    with tracer.start_as_current_span("llm.generate") as llm_span:
                        response = await self.model.generate(messages)
                        llm_span.set_attribute("llm.tokens.input", response.input_tokens)
                        llm_span.set_attribute("llm.tokens.output", response.output_tokens)
                        total_tokens += response.total_tokens

                    if response.tool_calls:
                        for tc in response.tool_calls:
                            with tracer.start_as_current_span(f"tool.{tc.name}") as tool_span:
                                tool_span.set_attribute("tool.name", tc.name)
                                tool_span.set_attribute("tool.args", str(tc.args))
                                result = await self.tools.execute(tc)
                                tool_span.set_attribute("tool.success", result.success)
                    else:
                        break

            root_span.set_attribute("agent.total_tokens", total_tokens)
            root_span.set_attribute("agent.steps", step + 1)
            root_span.set_attribute("agent.cost_usd", self._estimate_cost(total_tokens))
</code>

===== Key Platforms =====
^ Platform ^ Key Strengths ^ Overhead ^
| **[[langsmith|LangSmith]]** | Comprehensive tracing, latency/token/cost breakdowns, evaluations | ~0% |
| **[[arize_phoenix|Arize Phoenix]]** | OpenTelemetry-native, drift detection, cluster analysis | Low |
| **[[langfuse|Langfuse]]** | Trace dashboards, environment filtering, cost management | 12-15% |
| **Braintrust** | Nested multi-agent traces, auto-test conversion, scorers | Low |
| **Monte Carlo** | Trajectory monitors, behavioral regression detection | Varies |
| **Galileo** | Cost/latency/quality tracking, safety checks, tool graphs | Low |
| **AgentOps** | Session replays, multi-agent tracing | Moderate |
| **Helicone** | Proxy-based cost optimization, multi-provider routing | Minimal |

===== Production Best Practices =====
  * **Instrument at the span level** - Create spans for every LLM call, tool invocation, and decision point
  * **Track cost per outcome** - Not just total cost, but cost efficiency relative to task success
  * **Set latency budgets** - Define acceptable response times and alert on breaches
  * **Monitor behavioral consistency** - Detect when agent tool-use patterns shift unexpectedly
  * **Evaluate continuously** - Score production outputs against golden datasets in real time
  * **Alert on anomalies** - Cost spikes, error rate increases, or loop detection trigger immediate notification
  * **Retain traces for debugging** - Store full traces for post-incident analysis and improvement

===== Alerting Strategies =====
Production agent monitoring should include alerts for:((LangWatch. "Monitoring LLM Agent Applications in 2026." [[https://langwatch.ai/blog/4-best-tools-for-monitoring-llm-agentapplications-in-2026|langwatch.ai]]))(([[https://www.braintrust.dev/articles/best-ai-agent-observability-tools-2026|Braintrust - Best Agent Observability Tools 2026]]))

  * **Cost overruns** - Per-request or per-hour cost exceeds thresholds
  * **Error cascades** - Tool failures exceeding baseline rates
  * **Latency degradation** - Response time percentiles (p50, p95, p99) increasing
  * **Loop detection** - Agent executing repetitive actions without progress
  * **Quality drops** - Automated evaluation scores declining below thresholds

===== See Also =====
  * [[agent_debugging|Agent Debugging]]
  * [[how_to_monitor_agents|How to Monitor Agents]]
  * [[agent_evaluation|Agent Evaluation]]
  * [[production_agent_monitoring|Production Agent Monitoring and Sandboxing]]
  * [[agent_state_management|Agent State Management]]

===== References =====