Tool Result Parsing

Reliably handling tool outputs in AI agents, including JSON parsing, error handling, type coercion, retry on malformed output, and structured extraction patterns.

Overview

A significant portion of AI agent failures in production stems not from flawed reasoning but from “tool argument rot” – the generation of malformed JSON, missing fields, or incorrect data types when calling tools. OpenAI reported that enforcing strict JSON schema validation increased output compliance from under 40% to 100%. Teams have seen 7x improvements in multi-step workflow accuracy by adopting schema validation.

The core principle: shift from best-effort parsing to deterministic validation. The agent either produces a perfectly formatted tool call or it fails fast, eliminating ambiguous “close enough” attempts that introduce instability.

Error Taxonomy

Tool output failures fall into distinct categories:

Malformed JSON – Truncated output, unquoted keys, trailing commas, single quotes instead of double quotes
Schema violations – Missing required fields, wrong types, extra properties
Partial responses – Valid JSON but incomplete data due to token limit truncation
LLM artifacts – Special tokens like <|call|> or <|endoftext|> appended to JSON
Encoding issues – Unicode characters (em dashes, emoji) corrupting JSON strings

Schema Validation

Define strict schemas for every tool call and validate responses against them. Use Pydantic (Python) or Zod (TypeScript) for runtime validation with type coercion.

from pydantic import BaseModel, ValidationError, field_validator
from typing import Optional
import json
 
 
class SearchResult(BaseModel):
    title: str
    url: str
    relevance_score: float
    snippet: Optional[str] = None
 
    @field_validator("relevance_score")
    @classmethod
    def score_in_range(cls, v):
        if not 0.0 <= v <= 1.0:
            raise ValueError(f"relevance_score must be 0.0-1.0, got {v}")
        return v
 
 
class ToolOutput(BaseModel):
    tool_name: str
    success: bool
    results: list[SearchResult] = []
    error: Optional[str] = None
 
 
def parse_tool_output(raw: str) -> ToolOutput:
    """Parse and validate tool output with progressive fallback."""
    # Step 1: Clean LLM artifacts
    cleaned = raw.strip()
    for artifact in ["<|call|>", "<|endoftext|>", "```json", "```"]:
        cleaned = cleaned.replace(artifact, "")
 
    # Step 2: Attempt JSON parse
    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError:
        # Step 3: Try to repair common issues
        data = attempt_json_repair(cleaned)
 
    # Step 4: Validate against schema
    return ToolOutput.model_validate(data)
 
 
def attempt_json_repair(raw: str) -> dict:
    """Attempt to fix common JSON malformations."""
    import re
    text = raw.strip()
    # Fix single quotes -> double quotes
    text = text.replace("'", '"')
    # Fix trailing commas before closing brackets
    text = re.sub(r",\s*([}\]])", r"\1", text)
    # Fix unquoted keys
    text = re.sub(r"(\{|,)\s*(\w+)\s*:", r'\1 "\2":', text)
    return json.loads(text)

Self-Recovering Structured Output

When parsing fails, feed the error back to the LLM so it can self-correct. This is more effective than blind retries because the model receives specific feedback about what went wrong.

from dataclasses import dataclass
 
 
@dataclass
class ParseAttempt:
    success: bool
    result: Optional[ToolOutput] = None
    error: Optional[str] = None
 
 
async def parse_with_self_correction(
    llm_client,
    messages: list[dict],
    tool_schema: dict,
    max_retries: int = 3,
) -> ToolOutput:
    """Parse tool output with LLM self-correction on failure."""
    for attempt in range(max_retries):
        response = await llm_client.chat(messages, tools=[tool_schema])
 
        if not response.tool_calls:
            # Model gave a text response instead of tool call
            messages.append({"role": "assistant", "content": response.text})
            messages.append({
                "role": "user",
                "content": "Please use the tool to provide a structured response.",
            })
            continue
 
        for tool_call in response.tool_calls:
            try:
                result = parse_tool_output(
                    json.dumps(tool_call.arguments)
                )
                return result
            except (json.JSONDecodeError, ValidationError) as e:
                # Feed error back to model for self-correction
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": f"Parse error: {e}. Fix the JSON and retry.",
                })
 
    raise RuntimeError(f"Failed to parse tool output after {max_retries} attempts")

Structured Extraction Patterns

Use XML tags or JSON schemas in prompts to guide LLMs toward parseable outputs.

Best practices:

Define output schemas explicitly and reference them in prompts
Use delimiters (—BEGIN OUTPUT—) to clearly mark structured sections
Specify constraints: “Max 100 results, timeout 5s”
Include example outputs in the prompt for complex schemas
Prefer native tool calling over free-form JSON generation

Multi-Layer Defense

Production tool parsing should implement three layers:

Auto-repair – Fix common JSON issues (trailing tokens, unquoted keys, single quotes, trailing commas)
Error pipeline – When repair fails, send a descriptive error back to the model as a tool result so it can self-correct
Custom repair hook – Optional application-specific repair logic for known edge cases

This approach is used by the Mastra framework, which added JSON repair for malformed tool call arguments with these three layers.

Type Coercion

Handle common type mismatches gracefully:

String “123” → int 123 (when schema expects integer)
String “true” → bool True
String “null” → None
Single item → wrapped in list (when schema expects array)
Nested string JSON → parsed recursively

Pydantic and Zod both support configurable coercion modes. Use strict=False during initial parsing, then validate the coerced result against business rules.

Monitoring and Metrics

Track these metrics to measure parsing reliability:

Parse success rate – Percentage of tool calls that parse on first attempt
Retry count distribution – How often self-correction is needed
Failure by error type – Which malformations occur most frequently
Model-specific rates – Different LLMs have different failure modes
Schema-specific rates – Complex schemas fail more often

AI Agent Knowledge Base

Sidebar

Table of Contents

Tool Result Parsing

Overview

Error Taxonomy

Schema Validation

Self-Recovering Structured Output

Structured Extraction Patterns

Multi-Layer Defense

Type Coercion

Monitoring and Metrics

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Tool Result Parsing

Overview

Error Taxonomy

Schema Validation

Self-Recovering Structured Output

Structured Extraction Patterns

Multi-Layer Defense

Type Coercion

Monitoring and Metrics

References

See Also

Page Tools