====== Tool Result Parsing ======

Reliably handling tool outputs in AI agents, including JSON parsing, error handling, type coercion, retry on malformed output, and structured extraction patterns.

===== Overview =====

A significant portion of AI agent failures in production stems not from flawed reasoning but from "tool argument rot" -- the generation of malformed JSON, missing fields, or incorrect data types when calling tools. OpenAI reported that enforcing strict JSON schema validation increased output compliance from under 40% to 100%. Teams have seen 7x improvements in multi-step workflow accuracy by adopting schema validation.

The core principle: shift from best-effort parsing to deterministic validation. The agent either produces a perfectly formatted tool call or it fails fast, eliminating ambiguous "close enough" attempts that introduce instability.

===== Error Taxonomy =====

Tool output failures fall into distinct categories:

  * **Malformed JSON** -- Truncated output, unquoted keys, trailing commas, single quotes instead of double quotes
  * **Schema violations** -- Missing required fields, wrong types, extra properties
  * **Partial responses** -- Valid JSON but incomplete data due to token limit truncation
  * **LLM artifacts** -- Special tokens like ''<|call|>'' or ''<|endoftext|>'' appended to JSON
  * **Encoding issues** -- Unicode characters (em dashes, emoji) corrupting JSON strings

===== Schema Validation =====

Define strict schemas for every tool call and validate responses against them. Use Pydantic (Python) or Zod (TypeScript) for runtime validation with type coercion.

<code python>
from pydantic import BaseModel, ValidationError, field_validator
from typing import Optional
import json


class SearchResult(BaseModel):
    title: str
    url: str
    relevance_score: float
    snippet: Optional[str] = None

    @field_validator("relevance_score")
    @classmethod
    def score_in_range(cls, v):
        if not 0.0 <= v <= 1.0:
            raise ValueError(f"relevance_score must be 0.0-1.0, got {v}")
        return v


class ToolOutput(BaseModel):
    tool_name: str
    success: bool
    results: list[SearchResult] = []
    error: Optional[str] = None


def parse_tool_output(raw: str) -> ToolOutput:
    """Parse and validate tool output with progressive fallback."""
    # Step 1: Clean LLM artifacts
    cleaned = raw.strip()
    for artifact in ["<|call|>", "<|endoftext|>", "```json", "```"]:
        cleaned = cleaned.replace(artifact, "")

    # Step 2: Attempt JSON parse
    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError:
        # Step 3: Try to repair common issues
        data = attempt_json_repair(cleaned)

    # Step 4: Validate against schema
    return ToolOutput.model_validate(data)


def attempt_json_repair(raw: str) -> dict:
    """Attempt to fix common JSON malformations."""
    import re
    text = raw.strip()
    # Fix single quotes -> double quotes
    text = text.replace("'", '"')
    # Fix trailing commas before closing brackets
    text = re.sub(r",\s*([}\]])", r"\1", text)
    # Fix unquoted keys
    text = re.sub(r"(\{|,)\s*(\w+)\s*:", r'\1 "\2":', text)
    return json.loads(text)
</code>

===== Self-Recovering Structured Output =====

When parsing fails, feed the error back to the LLM so it can self-correct. This is more effective than blind retries because the model receives specific feedback about what went wrong.

<code python>
from dataclasses import dataclass


@dataclass
class ParseAttempt:
    success: bool
    result: Optional[ToolOutput] = None
    error: Optional[str] = None


async def parse_with_self_correction(
    llm_client,
    messages: list[dict],
    tool_schema: dict,
    max_retries: int = 3,
) -> ToolOutput:
    """Parse tool output with LLM self-correction on failure."""
    for attempt in range(max_retries):
        response = await llm_client.chat(messages, tools=[tool_schema])

        if not response.tool_calls:
            # Model gave a text response instead of tool call
            messages.append({"role": "assistant", "content": response.text})
            messages.append({
                "role": "user",
                "content": "Please use the tool to provide a structured response.",
            })
            continue

        for tool_call in response.tool_calls:
            try:
                result = parse_tool_output(
                    json.dumps(tool_call.arguments)
                )
                return result
            except (json.JSONDecodeError, ValidationError) as e:
                # Feed error back to model for self-correction
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": f"Parse error: {e}. Fix the JSON and retry.",
                })

    raise RuntimeError(f"Failed to parse tool output after {max_retries} attempts")
</code>

===== Structured Extraction Patterns =====

Use XML tags or JSON schemas in prompts to guide LLMs toward parseable outputs.

Best practices:
  * Define output schemas explicitly and reference them in prompts
  * Use delimiters (''---BEGIN OUTPUT---'') to clearly mark structured sections
  * Specify constraints: "Max 100 results, timeout 5s"
  * Include example outputs in the prompt for complex schemas
  * Prefer native tool calling over free-form JSON generation

===== Multi-Layer Defense =====

Production tool parsing should implement three layers:

  - **Auto-repair** -- Fix common JSON issues (trailing tokens, unquoted keys, single quotes, trailing commas)
  - **Error pipeline** -- When repair fails, send a descriptive error back to the model as a tool result so it can self-correct
  - **Custom repair hook** -- Optional application-specific repair logic for known edge cases

This approach is used by the Mastra framework, which added JSON repair for malformed tool call arguments with these three layers.

===== Type Coercion =====

Handle common type mismatches gracefully:

  * String "123" -> int 123 (when schema expects integer)
  * String "true" -> bool True
  * String "null" -> None
  * Single item -> wrapped in list (when schema expects array)
  * Nested string JSON -> parsed recursively

Pydantic and Zod both support configurable coercion modes. Use ''strict=False'' during initial parsing, then validate the coerced result against business rules.

===== Monitoring and Metrics =====

Track these metrics to measure parsing reliability:

  * **Parse success rate** -- Percentage of tool calls that parse on first attempt
  * **Retry count distribution** -- How often self-correction is needed
  * **Failure by error type** -- Which malformations occur most frequently
  * **Model-specific rates** -- Different LLMs have different failure modes
  * **Schema-specific rates** -- Complex schemas fail more often

===== References =====

  * [[https://micheallanham.substack.com/p/enhancing-ai-agent-reliability-with|Enhancing AI Agent Reliability with Structured Outputs and JSON Schema]]
  * [[https://usebrainbits.com/blog/self-recovering-structured-outputs|AI Sucks at Structured Outputs: Self-Recovering Pattern]]
  * [[https://github.com/mastra-ai/mastra/pull/12823|Mastra: JSON Repair for Malformed Tool Call Arguments]]
  * [[https://explained.tines.com/en/articles/11644147-best-practices-for-the-ai-agent-action|Best Practices for the AI Agent Action]]
  * [[https://dev.to/matt_frank_usa/building-multi-agent-ai-systems-architecture-patterns-and-best-practices-5cf|Building Multi-Agent AI Systems: Architecture Patterns]]

===== See Also =====

  * [[agent_error_recovery|Agent Error Recovery]]
  * [[agent_streaming|Agent Streaming]]
  * [[agent_prompt_injection_defense|Agent Prompt Injection Defense]]