====== Tool Result Parsing ======
Reliably handling tool outputs in AI agents, including JSON parsing, error handling, type coercion, retry on malformed output, and structured extraction patterns.
===== Overview =====
A significant portion of AI agent failures in production stems not from flawed reasoning but from "tool argument rot" -- the generation of malformed JSON, missing fields, or incorrect data types when calling tools. OpenAI reported that enforcing strict JSON schema validation increased output compliance from under 40% to 100%. Teams have seen 7x improvements in multi-step workflow accuracy by adopting schema validation.
The core principle: shift from best-effort parsing to deterministic validation. The agent either produces a perfectly formatted tool call or it fails fast, eliminating ambiguous "close enough" attempts that introduce instability.
===== Error Taxonomy =====
Tool output failures fall into distinct categories:
* **Malformed JSON** -- Truncated output, unquoted keys, trailing commas, single quotes instead of double quotes
* **Schema violations** -- Missing required fields, wrong types, extra properties
* **Partial responses** -- Valid JSON but incomplete data due to token limit truncation
* **LLM artifacts** -- Special tokens like ''<|call|>'' or ''<|endoftext|>'' appended to JSON
* **Encoding issues** -- Unicode characters (em dashes, emoji) corrupting JSON strings
===== Schema Validation =====
Define strict schemas for every tool call and validate responses against them. Use Pydantic (Python) or Zod (TypeScript) for runtime validation with type coercion.
from pydantic import BaseModel, ValidationError, field_validator
from typing import Optional
import json
class SearchResult(BaseModel):
title: str
url: str
relevance_score: float
snippet: Optional[str] = None
@field_validator("relevance_score")
@classmethod
def score_in_range(cls, v):
if not 0.0 <= v <= 1.0:
raise ValueError(f"relevance_score must be 0.0-1.0, got {v}")
return v
class ToolOutput(BaseModel):
tool_name: str
success: bool
results: list[SearchResult] = []
error: Optional[str] = None
def parse_tool_output(raw: str) -> ToolOutput:
"""Parse and validate tool output with progressive fallback."""
# Step 1: Clean LLM artifacts
cleaned = raw.strip()
for artifact in ["<|call|>", "<|endoftext|>", "```json", "```"]:
cleaned = cleaned.replace(artifact, "")
# Step 2: Attempt JSON parse
try:
data = json.loads(cleaned)
except json.JSONDecodeError:
# Step 3: Try to repair common issues
data = attempt_json_repair(cleaned)
# Step 4: Validate against schema
return ToolOutput.model_validate(data)
def attempt_json_repair(raw: str) -> dict:
"""Attempt to fix common JSON malformations."""
import re
text = raw.strip()
# Fix single quotes -> double quotes
text = text.replace("'", '"')
# Fix trailing commas before closing brackets
text = re.sub(r",\s*([}\]])", r"\1", text)
# Fix unquoted keys
text = re.sub(r"(\{|,)\s*(\w+)\s*:", r'\1 "\2":', text)
return json.loads(text)
===== Self-Recovering Structured Output =====
When parsing fails, feed the error back to the LLM so it can self-correct. This is more effective than blind retries because the model receives specific feedback about what went wrong.
from dataclasses import dataclass
@dataclass
class ParseAttempt:
success: bool
result: Optional[ToolOutput] = None
error: Optional[str] = None
async def parse_with_self_correction(
llm_client,
messages: list[dict],
tool_schema: dict,
max_retries: int = 3,
) -> ToolOutput:
"""Parse tool output with LLM self-correction on failure."""
for attempt in range(max_retries):
response = await llm_client.chat(messages, tools=[tool_schema])
if not response.tool_calls:
# Model gave a text response instead of tool call
messages.append({"role": "assistant", "content": response.text})
messages.append({
"role": "user",
"content": "Please use the tool to provide a structured response.",
})
continue
for tool_call in response.tool_calls:
try:
result = parse_tool_output(
json.dumps(tool_call.arguments)
)
return result
except (json.JSONDecodeError, ValidationError) as e:
# Feed error back to model for self-correction
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": f"Parse error: {e}. Fix the JSON and retry.",
})
raise RuntimeError(f"Failed to parse tool output after {max_retries} attempts")
===== Structured Extraction Patterns =====
Use XML tags or JSON schemas in prompts to guide LLMs toward parseable outputs.
Best practices:
* Define output schemas explicitly and reference them in prompts
* Use delimiters (''---BEGIN OUTPUT---'') to clearly mark structured sections
* Specify constraints: "Max 100 results, timeout 5s"
* Include example outputs in the prompt for complex schemas
* Prefer native tool calling over free-form JSON generation
===== Multi-Layer Defense =====
Production tool parsing should implement three layers:
- **Auto-repair** -- Fix common JSON issues (trailing tokens, unquoted keys, single quotes, trailing commas)
- **Error pipeline** -- When repair fails, send a descriptive error back to the model as a tool result so it can self-correct
- **Custom repair hook** -- Optional application-specific repair logic for known edge cases
This approach is used by the Mastra framework, which added JSON repair for malformed tool call arguments with these three layers.
===== Type Coercion =====
Handle common type mismatches gracefully:
* String "123" -> int 123 (when schema expects integer)
* String "true" -> bool True
* String "null" -> None
* Single item -> wrapped in list (when schema expects array)
* Nested string JSON -> parsed recursively
Pydantic and Zod both support configurable coercion modes. Use ''strict=False'' during initial parsing, then validate the coerced result against business rules.
===== Monitoring and Metrics =====
Track these metrics to measure parsing reliability:
* **Parse success rate** -- Percentage of tool calls that parse on first attempt
* **Retry count distribution** -- How often self-correction is needed
* **Failure by error type** -- Which malformations occur most frequently
* **Model-specific rates** -- Different LLMs have different failure modes
* **Schema-specific rates** -- Complex schemas fail more often
===== References =====
* [[https://micheallanham.substack.com/p/enhancing-ai-agent-reliability-with|Enhancing AI Agent Reliability with Structured Outputs and JSON Schema]]
* [[https://usebrainbits.com/blog/self-recovering-structured-outputs|AI Sucks at Structured Outputs: Self-Recovering Pattern]]
* [[https://github.com/mastra-ai/mastra/pull/12823|Mastra: JSON Repair for Malformed Tool Call Arguments]]
* [[https://explained.tines.com/en/articles/11644147-best-practices-for-the-ai-agent-action|Best Practices for the AI Agent Action]]
* [[https://dev.to/matt_frank_usa/building-multi-agent-ai-systems-architecture-patterns-and-best-practices-5cf|Building Multi-Agent AI Systems: Architecture Patterns]]
===== See Also =====
* [[agent_error_recovery|Agent Error Recovery]]
* [[agent_streaming|Agent Streaming]]
* [[agent_prompt_injection_defense|Agent Prompt Injection Defense]]