====== Agent Debugging ======
Agent debugging and observability encompasses the tools, patterns, and practices for understanding, monitoring, and troubleshooting AI agent behavior in development and production. Unlike traditional software debugging, agent systems are non-deterministic, execute multi-step reasoning chains, and use external tools dynamically — requiring specialized infrastructure to trace decisions, measure quality, and detect failures.

===== Why Agent Debugging Differs =====
Traditional monitoring is fundamentally inadequate for AI agents because:

  * **Multi-step complexity** — A single user request may trigger 15+ LLM calls across multiple chains, tools, and models
  * **Non-determinism** — The same input can produce different outputs, making reproduction difficult
  * **Quality beyond uptime** — Success requires measuring accuracy, hallucination rates, task completion, and alignment — not just availability
  * **Dynamic tool use** — Agents select and invoke tools at runtime, requiring tracing of which tools were called, their outputs, and their influence on decisions

===== Core Observability Capabilities =====
==== Distributed Tracing ====
Captures complete execution paths from user input through tool invocations to final response. Each step records inputs, outputs, latency, token usage, and costs.

==== Agent Decision Graphs ====
Visualizes the internal state machine showing how agents, tools, and components interact step-by-step. Makes debugging 10x faster than reading raw logs.

==== Automated Evaluation ====
In-production quality assessments using custom rules, deterministic evaluators, statistical checks, and [[llm_as_judge|LLM-as-a-judge]] approaches for continuous quality monitoring.

===== Major Platforms =====
| **Platform** | **Key Strength** | **Integration** | **Pricing** |
| [[https://smith.langchain.com|LangSmith]] | Execution timeline, custom evaluators | Native LangChain, OpenTelemetry | Free tier + paid |((Maxim. "Top 5 AI Agent Observability Platforms in 2026." [[https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/|getmaxim.ai]]))
| [[https://phoenix.arize.com|Arize Phoenix]] | Advanced analytics, drift detection, cluster analysis | OpenTelemetry, any framework | Open-source + enterprise |
| [[https://wandb.ai|Weights & Biases]] | Experiment tracking, model monitoring | Framework-agnostic | Free tier + paid |
| [[https://www.braintrust.dev|Braintrust]] | Prompt playground, regression testing | Any LLM provider | Free tier + paid |
| [[https://www.traceloop.com|OpenLLMetry]] | OpenTelemetry-native LLM tracing | Vendor-agnostic, open-source | Open-source |

===== Tracing Example =====
<code python>
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

# Set up tracing for [[agent_observability|agent observability]]
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.tracer")

def traced_agent_step(step_name, func):
    """Wrap each agent step with tracing."""
    def wrapper(*args, **kwargs):
        with tracer.start_as_current_span(step_name) as span:
            span.set_attribute("agent.step", step_name)
            span.set_attribute("agent.input", str(args))
            try:
                result = func(*args, **kwargs)
                span.set_attribute("agent.output", str(result)[:1000])
                span.set_attribute("agent.status", "success")
                return result
            except Exception as e:
                span.set_attribute("agent.status", "error")
                span.set_attribute("agent.error", str(e))
                raise
    return wrapper

@traced_agent_step("retrieve_context")
def retrieve(query):
    return vector_store.similarity_search(query, k=5)

@traced_agent_step("generate_response")
def generate(context, query):
    return llm.invoke(f"Context: {context}\nQuery: {query}")

@traced_agent_step("agent_loop")
def agent(query):
    context = retrieve(query)
    return generate(context, query)
</code>

===== Logging Patterns =====
Effective agent observability follows these patterns:((Breyta. "Best AI Agent Observability Tools." [[https://breyta.ai/blog/best-ai-agent-observability-tools|breyta.ai]]))

  * **Structured logging** — JSON logs with consistent fields: ''step'', ''tool'', ''input'', ''output'', ''latency_ms'', ''token_count'', ''cost''
  * **Trace correlation** — Link all steps in a single agent run with a shared trace ID
  * **Production-to-test pipeline** — Promote production failures to versioned test datasets for regression prevention
  * **Cost accounting** — Per-step token and cost tracking for optimization and margin management
  * **Natural language search** — Query traces with descriptions like "agent hallucinated tool arguments" instead of complex SQL

===== Key Metrics =====
| **Metric** | **What It Measures** | **Target** |
| End-to-end latency | Total time from request to response | <2s for interactive agents |
| Token usage per request | Total tokens consumed across all LLM calls | Minimize for cost efficiency |
| Task completion rate | Percentage of tasks successfully completed | >90% for production |
| Hallucination rate | Responses containing unsupported claims | <5% |
| Tool call success rate | Percentage of tool invocations that succeed | >95% |
| Cost per request | Total API costs for a single agent interaction | Track trend over time |

===== OpenTelemetry for LLMs =====
OpenTelemetry has become the standard for vendor-agnostic [[agent_observability|agent observability]].((Arize. "Best AI Observability Tools for Autonomous Agents in 2026." [[https://arize.com/blog/best-ai-observability-tools-for-autonomous-agents-in-2026/|arize.com]])) Projects like [[https://[[github|github]].com/traceloop/openllmetry|OpenLLMetry]] extend OpenTelemetry with LLM-specific semantic conventions, enabling teams to switch platforms without re-instrumenting code.

===== See Also =====
  * [[agent_observability|Agent Observability]]
  * [[how_to_monitor_agents|How to Monitor Agents]]
  * [[how_to_evaluate_an_agent|How to Evaluate an Agent]]
  * [[ai_agents_devops|AI Agents for DevOps]]
  * [[agent_evaluation|Agent Evaluation]]

===== References =====