Table of Contents

Agent Debugging

Agent debugging and observability encompasses the tools, patterns, and practices for understanding, monitoring, and troubleshooting AI agent behavior in development and production. Unlike traditional software debugging, agent systems are non-deterministic, execute multi-step reasoning chains, and use external tools dynamically — requiring specialized infrastructure to trace decisions, measure quality, and detect failures.

Why Agent Debugging Differs

Traditional monitoring is fundamentally inadequate for AI agents because:

Core Observability Capabilities

Distributed Tracing

Captures complete execution paths from user input through tool invocations to final response. Each step records inputs, outputs, latency, token usage, and costs.

Agent Decision Graphs

Visualizes the internal state machine showing how agents, tools, and components interact step-by-step. Makes debugging 10x faster than reading raw logs.

Automated Evaluation

In-production quality assessments using custom rules, deterministic evaluators, statistical checks, and LLM-as-a-judge approaches for continuous quality monitoring.

Major Platforms

Platform Key Strength Integration Pricing
LangSmith Execution timeline, custom evaluators Native LangChain, OpenTelemetry Free tier + paid
Arize Phoenix Advanced analytics, drift detection, cluster analysis OpenTelemetry, any framework Open-source + enterprise
Weights & Biases Experiment tracking, model monitoring Framework-agnostic Free tier + paid
Braintrust Prompt playground, regression testing Any LLM provider Free tier + paid
OpenLLMetry OpenTelemetry-native LLM tracing Vendor-agnostic, open-source Open-source

Tracing Example

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
 
# Set up tracing for agent observability
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.tracer")
 
def traced_agent_step(step_name, func):
    """Wrap each agent step with tracing."""
    def wrapper(*args, **kwargs):
        with tracer.start_as_current_span(step_name) as span:
            span.set_attribute("agent.step", step_name)
            span.set_attribute("agent.input", str(args))
            try:
                result = func(*args, **kwargs)
                span.set_attribute("agent.output", str(result)[:1000])
                span.set_attribute("agent.status", "success")
                return result
            except Exception as e:
                span.set_attribute("agent.status", "error")
                span.set_attribute("agent.error", str(e))
                raise
    return wrapper
 
@traced_agent_step("retrieve_context")
def retrieve(query):
    return vector_store.similarity_search(query, k=5)
 
@traced_agent_step("generate_response")
def generate(context, query):
    return llm.invoke(f"Context: {context}\nQuery: {query}")
 
@traced_agent_step("agent_loop")
def agent(query):
    context = retrieve(query)
    return generate(context, query)

Logging Patterns

Effective agent observability follows these patterns:

Key Metrics

Metric What It Measures Target
End-to-end latency Total time from request to response <2s for interactive agents
Token usage per request Total tokens consumed across all LLM calls Minimize for cost efficiency
Task completion rate Percentage of tasks successfully completed >90% for production
Hallucination rate Responses containing unsupported claims <5%
Tool call success rate Percentage of tool invocations that succeed >95%
Cost per request Total API costs for a single agent interaction Track trend over time

OpenTelemetry for LLMs

OpenTelemetry has become the standard for vendor-agnostic agent observability. Projects like OpenLLMetry extend OpenTelemetry with LLM-specific semantic conventions, enabling teams to switch platforms without re-instrumenting code.

References

See Also