====== Tool Result Parsing ====== Reliably handling tool outputs in AI agents, including JSON parsing, error handling, type coercion, retry on malformed output, and structured extraction patterns. ===== Overview ===== A significant portion of AI agent failures in production stems not from flawed reasoning but from "tool argument rot" -- the generation of malformed JSON, missing fields, or incorrect data types when calling tools. OpenAI reported that enforcing strict JSON schema validation increased output compliance from under 40% to 100%. Teams have seen 7x improvements in multi-step workflow accuracy by adopting schema validation. The core principle: shift from best-effort parsing to deterministic validation. The agent either produces a perfectly formatted tool call or it fails fast, eliminating ambiguous "close enough" attempts that introduce instability. ===== Error Taxonomy ===== Tool output failures fall into distinct categories: * **Malformed JSON** -- Truncated output, unquoted keys, trailing commas, single quotes instead of double quotes * **Schema violations** -- Missing required fields, wrong types, extra properties * **Partial responses** -- Valid JSON but incomplete data due to token limit truncation * **LLM artifacts** -- Special tokens like ''<|call|>'' or ''<|endoftext|>'' appended to JSON * **Encoding issues** -- Unicode characters (em dashes, emoji) corrupting JSON strings ===== Schema Validation ===== Define strict schemas for every tool call and validate responses against them. Use Pydantic (Python) or Zod (TypeScript) for runtime validation with type coercion. from pydantic import BaseModel, ValidationError, field_validator from typing import Optional import json class SearchResult(BaseModel): title: str url: str relevance_score: float snippet: Optional[str] = None @field_validator("relevance_score") @classmethod def score_in_range(cls, v): if not 0.0 <= v <= 1.0: raise ValueError(f"relevance_score must be 0.0-1.0, got {v}") return v class ToolOutput(BaseModel): tool_name: str success: bool results: list[SearchResult] = [] error: Optional[str] = None def parse_tool_output(raw: str) -> ToolOutput: """Parse and validate tool output with progressive fallback.""" # Step 1: Clean LLM artifacts cleaned = raw.strip() for artifact in ["<|call|>", "<|endoftext|>", "```json", "```"]: cleaned = cleaned.replace(artifact, "") # Step 2: Attempt JSON parse try: data = json.loads(cleaned) except json.JSONDecodeError: # Step 3: Try to repair common issues data = attempt_json_repair(cleaned) # Step 4: Validate against schema return ToolOutput.model_validate(data) def attempt_json_repair(raw: str) -> dict: """Attempt to fix common JSON malformations.""" import re text = raw.strip() # Fix single quotes -> double quotes text = text.replace("'", '"') # Fix trailing commas before closing brackets text = re.sub(r",\s*([}\]])", r"\1", text) # Fix unquoted keys text = re.sub(r"(\{|,)\s*(\w+)\s*:", r'\1 "\2":', text) return json.loads(text) ===== Self-Recovering Structured Output ===== When parsing fails, feed the error back to the LLM so it can self-correct. This is more effective than blind retries because the model receives specific feedback about what went wrong. from dataclasses import dataclass @dataclass class ParseAttempt: success: bool result: Optional[ToolOutput] = None error: Optional[str] = None async def parse_with_self_correction( llm_client, messages: list[dict], tool_schema: dict, max_retries: int = 3, ) -> ToolOutput: """Parse tool output with LLM self-correction on failure.""" for attempt in range(max_retries): response = await llm_client.chat(messages, tools=[tool_schema]) if not response.tool_calls: # Model gave a text response instead of tool call messages.append({"role": "assistant", "content": response.text}) messages.append({ "role": "user", "content": "Please use the tool to provide a structured response.", }) continue for tool_call in response.tool_calls: try: result = parse_tool_output( json.dumps(tool_call.arguments) ) return result except (json.JSONDecodeError, ValidationError) as e: # Feed error back to model for self-correction messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": f"Parse error: {e}. Fix the JSON and retry.", }) raise RuntimeError(f"Failed to parse tool output after {max_retries} attempts") ===== Structured Extraction Patterns ===== Use XML tags or JSON schemas in prompts to guide LLMs toward parseable outputs. Best practices: * Define output schemas explicitly and reference them in prompts * Use delimiters (''---BEGIN OUTPUT---'') to clearly mark structured sections * Specify constraints: "Max 100 results, timeout 5s" * Include example outputs in the prompt for complex schemas * Prefer native tool calling over free-form JSON generation ===== Multi-Layer Defense ===== Production tool parsing should implement three layers: - **Auto-repair** -- Fix common JSON issues (trailing tokens, unquoted keys, single quotes, trailing commas) - **Error pipeline** -- When repair fails, send a descriptive error back to the model as a tool result so it can self-correct - **Custom repair hook** -- Optional application-specific repair logic for known edge cases This approach is used by the Mastra framework, which added JSON repair for malformed tool call arguments with these three layers. ===== Type Coercion ===== Handle common type mismatches gracefully: * String "123" -> int 123 (when schema expects integer) * String "true" -> bool True * String "null" -> None * Single item -> wrapped in list (when schema expects array) * Nested string JSON -> parsed recursively Pydantic and Zod both support configurable coercion modes. Use ''strict=False'' during initial parsing, then validate the coerced result against business rules. ===== Monitoring and Metrics ===== Track these metrics to measure parsing reliability: * **Parse success rate** -- Percentage of tool calls that parse on first attempt * **Retry count distribution** -- How often self-correction is needed * **Failure by error type** -- Which malformations occur most frequently * **Model-specific rates** -- Different LLMs have different failure modes * **Schema-specific rates** -- Complex schemas fail more often ===== References ===== * [[https://micheallanham.substack.com/p/enhancing-ai-agent-reliability-with|Enhancing AI Agent Reliability with Structured Outputs and JSON Schema]] * [[https://usebrainbits.com/blog/self-recovering-structured-outputs|AI Sucks at Structured Outputs: Self-Recovering Pattern]] * [[https://github.com/mastra-ai/mastra/pull/12823|Mastra: JSON Repair for Malformed Tool Call Arguments]] * [[https://explained.tines.com/en/articles/11644147-best-practices-for-the-ai-agent-action|Best Practices for the AI Agent Action]] * [[https://dev.to/matt_frank_usa/building-multi-agent-ai-systems-architecture-patterns-and-best-practices-5cf|Building Multi-Agent AI Systems: Architecture Patterns]] ===== See Also ===== * [[agent_error_recovery|Agent Error Recovery]] * [[agent_streaming|Agent Streaming]] * [[agent_prompt_injection_defense|Agent Prompt Injection Defense]]