====== Agent Debugging ====== Agent debugging and observability encompasses the tools, patterns, and practices for understanding, monitoring, and troubleshooting AI agent behavior in development and production. Unlike traditional software debugging, agent systems are non-deterministic, execute multi-step reasoning chains, and use external tools dynamically — requiring specialized infrastructure to trace decisions, measure quality, and detect failures. ===== Why Agent Debugging Differs ===== Traditional monitoring is fundamentally inadequate for AI agents because: * **Multi-step complexity** — A single user request may trigger 15+ LLM calls across multiple chains, tools, and models * **Non-determinism** — The same input can produce different outputs, making reproduction difficult * **Quality beyond uptime** — Success requires measuring accuracy, hallucination rates, task completion, and alignment — not just availability * **Dynamic tool use** — Agents select and invoke tools at runtime, requiring tracing of which tools were called, their outputs, and their influence on decisions ===== Core Observability Capabilities ===== ==== Distributed Tracing ==== Captures complete execution paths from user input through tool invocations to final response. Each step records inputs, outputs, latency, token usage, and costs. ==== Agent Decision Graphs ==== Visualizes the internal state machine showing how agents, tools, and components interact step-by-step. Makes debugging 10x faster than reading raw logs. ==== Automated Evaluation ==== In-production quality assessments using custom rules, deterministic evaluators, statistical checks, and LLM-as-a-judge approaches for continuous quality monitoring. ===== Major Platforms ===== | **Platform** | **Key Strength** | **Integration** | **Pricing** | | [[https://smith.langchain.com|LangSmith]] | Execution timeline, custom evaluators | Native LangChain, OpenTelemetry | Free tier + paid | | [[https://phoenix.arize.com|Arize Phoenix]] | Advanced analytics, drift detection, cluster analysis | OpenTelemetry, any framework | Open-source + enterprise | | [[https://wandb.ai|Weights & Biases]] | Experiment tracking, model monitoring | Framework-agnostic | Free tier + paid | | [[https://www.braintrust.dev|Braintrust]] | Prompt playground, regression testing | Any LLM provider | Free tier + paid | | [[https://www.traceloop.com|OpenLLMetry]] | OpenTelemetry-native LLM tracing | Vendor-agnostic, open-source | Open-source | ===== Tracing Example ===== from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Set up tracing for agent observability provider = TracerProvider() provider.add_span_processor(SimpleSpanProcessor(exporter)) trace.set_tracer_provider(provider) tracer = trace.get_tracer("agent.tracer") def traced_agent_step(step_name, func): """Wrap each agent step with tracing.""" def wrapper(*args, **kwargs): with tracer.start_as_current_span(step_name) as span: span.set_attribute("agent.step", step_name) span.set_attribute("agent.input", str(args)) try: result = func(*args, **kwargs) span.set_attribute("agent.output", str(result)[:1000]) span.set_attribute("agent.status", "success") return result except Exception as e: span.set_attribute("agent.status", "error") span.set_attribute("agent.error", str(e)) raise return wrapper @traced_agent_step("retrieve_context") def retrieve(query): return vector_store.similarity_search(query, k=5) @traced_agent_step("generate_response") def generate(context, query): return llm.invoke(f"Context: {context}\nQuery: {query}") @traced_agent_step("agent_loop") def agent(query): context = retrieve(query) return generate(context, query) ===== Logging Patterns ===== Effective agent observability follows these patterns: * **Structured logging** — JSON logs with consistent fields: ''step'', ''tool'', ''input'', ''output'', ''latency_ms'', ''token_count'', ''cost'' * **Trace correlation** — Link all steps in a single agent run with a shared trace ID * **Production-to-test pipeline** — Promote production failures to versioned test datasets for regression prevention * **Cost accounting** — Per-step token and cost tracking for optimization and margin management * **Natural language search** — Query traces with descriptions like "agent hallucinated tool arguments" instead of complex SQL ===== Key Metrics ===== | **Metric** | **What It Measures** | **Target** | | End-to-end latency | Total time from request to response | <2s for interactive agents | | Token usage per request | Total tokens consumed across all LLM calls | Minimize for cost efficiency | | Task completion rate | Percentage of tasks successfully completed | >90% for production | | Hallucination rate | Responses containing unsupported claims | <5% | | Tool call success rate | Percentage of tool invocations that succeed | >95% | | Cost per request | Total API costs for a single agent interaction | Track trend over time | ===== OpenTelemetry for LLMs ===== OpenTelemetry has become the standard for vendor-agnostic agent observability. Projects like [[https://github.com/traceloop/openllmetry|OpenLLMetry]] extend OpenTelemetry with LLM-specific semantic conventions, enabling teams to switch platforms without re-instrumenting code. ===== References ===== * [[https://arize.com/blog/best-ai-observability-tools-for-autonomous-agents-in-2026/|Arize - Best AI Observability Tools 2026]] * [[https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/|Maxim - Top AI Agent Observability Platforms]] * [[https://breyta.ai/blog/best-ai-agent-observability-tools|Breyta - Best AI Agent Observability Tools]] ===== See Also ===== * [[agent_orchestration]] — Tracing multi-agent orchestration flows * [[agent_safety]] — Detecting safety issues through observability * [[agent_frameworks]] — Framework-level debugging support * [[prompt_engineering]] — Iterating on prompts with observability data