Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Agent debugging and observability encompasses the tools, patterns, and practices for understanding, monitoring, and troubleshooting AI agent behavior in development and production. Unlike traditional software debugging, agent systems are non-deterministic, execute multi-step reasoning chains, and use external tools dynamically — requiring specialized infrastructure to trace decisions, measure quality, and detect failures.
Traditional monitoring is fundamentally inadequate for AI agents because:
Captures complete execution paths from user input through tool invocations to final response. Each step records inputs, outputs, latency, token usage, and costs.
Visualizes the internal state machine showing how agents, tools, and components interact step-by-step. Makes debugging 10x faster than reading raw logs.
In-production quality assessments using custom rules, deterministic evaluators, statistical checks, and LLM-as-a-judge approaches for continuous quality monitoring.
| Platform | Key Strength | Integration | Pricing |
| LangSmith | Execution timeline, custom evaluators | Native LangChain, OpenTelemetry | Free tier + paid |
| Arize Phoenix | Advanced analytics, drift detection, cluster analysis | OpenTelemetry, any framework | Open-source + enterprise |
| Weights & Biases | Experiment tracking, model monitoring | Framework-agnostic | Free tier + paid |
| Braintrust | Prompt playground, regression testing | Any LLM provider | Free tier + paid |
| OpenLLMetry | OpenTelemetry-native LLM tracing | Vendor-agnostic, open-source | Open-source |
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Set up tracing for agent observability provider = TracerProvider() provider.add_span_processor(SimpleSpanProcessor(exporter)) trace.set_tracer_provider(provider) tracer = trace.get_tracer("agent.tracer") def traced_agent_step(step_name, func): """Wrap each agent step with tracing.""" def wrapper(*args, **kwargs): with tracer.start_as_current_span(step_name) as span: span.set_attribute("agent.step", step_name) span.set_attribute("agent.input", str(args)) try: result = func(*args, **kwargs) span.set_attribute("agent.output", str(result)[:1000]) span.set_attribute("agent.status", "success") return result except Exception as e: span.set_attribute("agent.status", "error") span.set_attribute("agent.error", str(e)) raise return wrapper @traced_agent_step("retrieve_context") def retrieve(query): return vector_store.similarity_search(query, k=5) @traced_agent_step("generate_response") def generate(context, query): return llm.invoke(f"Context: {context}\nQuery: {query}") @traced_agent_step("agent_loop") def agent(query): context = retrieve(query) return generate(context, query)
Effective agent observability follows these patterns:
step, tool, input, output, latency_ms, token_count, cost| Metric | What It Measures | Target |
| End-to-end latency | Total time from request to response | <2s for interactive agents |
| Token usage per request | Total tokens consumed across all LLM calls | Minimize for cost efficiency |
| Task completion rate | Percentage of tasks successfully completed | >90% for production |
| Hallucination rate | Responses containing unsupported claims | <5% |
| Tool call success rate | Percentage of tool invocations that succeed | >95% |
| Cost per request | Total API costs for a single agent interaction | Track trend over time |
OpenTelemetry has become the standard for vendor-agnostic agent observability. Projects like OpenLLMetry extend OpenTelemetry with LLM-specific semantic conventions, enabling teams to switch platforms without re-instrumenting code.