====== Structured Outputs ====== Structured outputs refer to techniques and tools that constrain LLM generation to produce well-formed data in a specified format (JSON, XML, SQL, code, etc.) rather than free-form text. This capability is essential for integrating LLMs into software systems where downstream components require predictable, parseable responses. ===== Why Structured Outputs Matter ===== LLMs natively produce free-form text, but production applications need: * **Reliable parsing**: API responses must conform to schemas for programmatic consumption * **Type safety**: Fields must have correct types (strings, numbers, booleans, arrays) * **Completeness**: All required fields must be present * **Consistency**: Outputs must follow the same structure across invocations * **Integration**: Structured data connects LLMs to databases, APIs, UI components, and [[tool_utilization|tool pipelines]] Without structured output guarantees, applications resort to brittle regex parsing, retry loops, and manual validation, all of which degrade reliability and increase latency. ===== Approaches to Structured Output ===== ==== 1. Prompting-Based ==== The simplest approach: instruct the model to output a specific format via the prompt. * **Pros**: Works with any model, no special tooling required * **Cons**: No guarantees; models may include preamble text, miss fields, or produce malformed output * **Techniques**: Few-shot examples, explicit format instructions, "respond only with valid JSON" ==== 2. Function Calling / Tool Use ==== Model providers offer native function-calling interfaces where the model selects and populates structured function parameters: * **[[function_calling|OpenAI Function Calling]]** (2023+): Model outputs JSON arguments matching a function schema; extended in 2024-2025 with parallel function calls and strict mode * **[[anthropic|Anthropic]] Tool Use**: [[claude|Claude]] models output structured tool calls with typed parameters; supports complex nested schemas * **[[google_gemini|Google Gemini]] [[function_calling|Function Calling]]**: Similar structured invocation with grounding in Google Search and other tools [[function_calling|Function calling]] has become the de facto standard for structured agent interactions, serving as the backbone of [[tool_utilization|tool utilization]] in modern agent frameworks. ==== 3. Constrained Decoding ==== Intervenes during token generation to mask invalid tokens, guaranteeing schema compliance: * **OpenAI Structured Outputs** (2024-2025): Uses a Context-Free Grammar (CFG) engine to enforce JSON Schema compliance at generation time. GPT-4o and GPT-5 achieve 100% schema compliance in strict mode with ~50% latency reduction vs. unconstrained generation with retries.(([[https://platform.openai.com/docs/guides/structured-outputs|OpenAI Structured Outputs Documentation. 2024.]])) * **SGLang** (2024-2025): High-performance serving framework with built-in constrained decoding for structured outputs * **[[vllm|vLLM]]**: Supports guided generation via [[outlines|Outlines]] integration * **[[llama_cpp|llama.cpp]]**: Grammar-based sampling that constrains generation to GBNF grammars; achieves top performance on JSON Schema Store benchmarks **How it works**: At each token generation step, a finite-state automaton or pushdown automaton derived from the target schema masks logits for tokens that would violate the schema. This guarantees structural validity without post-processing. The following example uses [[openai|OpenAI]]'s native structured output with ''response_format'' to guarantee a valid JSON response matching a Pydantic schema: # [[openai|OpenAI]] Structured Outputs with response_format and Pydantic from [[openai|openai]] import [[openai|OpenAI]] from pydantic import BaseModel class MovieReview(BaseModel): title: str rating: float pros: list[str] cons: list[str] recommended: bool client = [[openai|OpenAI]]() completion = client.beta.chat.completions.parse( model="gpt-4o", messages=[ {"role": "system", "content": "Extract a structured movie review."}, {"role": "user", "content": "Dune Part Two was visually stunning with great acting. " "The pacing dragged in the middle. 8.5/10, highly recommended."}, ], response_format=MovieReview, ) review = completion.choices[0].message.parsed print(f"{review.title}: {review.rating}/10 - Recommended: {review.recommended}") print(f"Pros: {review.pros}") ==== 4. Grammar-Based Generation ==== Libraries that compile schemas into generation grammars: * **[[outlines|Outlines]]** ([[https://[[github|github]].com/dottxt-ai/outlines|dottxt]], 2023-2025): Python library that compiles JSON Schema, regex, or CFG into efficient token masks; works with any HuggingFace model * **Guidance**(([[https://[[github|github]].com/guidance-ai/guidance|Guidance: A Guidance Language for Controlling LLMs. Microsoft, 2023.]])) ([[https://[[github|github]].com/guidance-ai/guidance|Microsoft]], 2023-2025): Interleaves generation with programmatic control flow; highest empirical coverage across benchmarks per JSONSchemaBench (2025) * **LMQL** (2023): SQL-like query language for LLMs with type constraints and scripted prompting * **XGrammar** (2024): High-performance grammar-based constrained decoding engine ==== 5. Post-Processing Transformation ==== **SLOT (Structured LLM Output Transformer)** (EMNLP Industry 2025): A model-agnostic approach using a lightweight fine-tuned model to transform unstructured LLM output into schema-compliant structured data.(([[https://[[github|github]].com/dottxt-ai/outlines|Outlines: Structured Text Generation. dottxt, 2023.]])) A fine-tuned Mistral-7B achieves 99.5% schema accuracy and 94.0% content similarity, and even compact models like Llama-3.2-1B can match larger proprietary models. ===== Libraries and Frameworks ===== ==== Instructor ==== (([[https://[[github|github]].com/jxnl/instructor|Instructor: Structured LLM Outputs. jxnl, 2023.]])) * **[[github|GitHub]]**: [[https://github.com/jxnl/instructor]] * Built on Pydantic for schema definition and validation * Supports [[openai|OpenAI]], [[anthropic|Anthropic]], [[google|Google]], Mistral, and open-source models * Automatic retries with validation error feedback ==== LangChain ==== * **[[github|GitHub]]**: [[https://github.com/[[langchain|langchain]]-ai/langchain]] * ''.with_structured_output()'' method for any supported model * PydanticOutputParser and StructuredOutputParser * Integration with [[tool_utilization|tool pipelines]] ==== BAML ==== * **Website**: [[https://www.baml.dev/]] * Domain-specific language for defining LLM functions with typed inputs/outputs * Compiler generates type-safe client code * Built-in testing and validation ==== Marvin ==== * **[[github|GitHub]]**: [[https://github.com/PrefectHQ/marvin]] * Lightweight Python library for structured extraction and classification * Uses Pydantic models as output schemas ==== Magentic ==== * **[[github|GitHub]]**: [[https://github.com/jackmpcollins/magentic]] * Decorator-based interface: annotate functions with return types and get structured outputs * Works with Pydantic models for complex structured outputs ==== LlamaIndex ==== * **Website**: [[https://www.[[llamaindex|llamaindex]].ai/]] * Structured output support via Pydantic programs and output parsers * Deep integration with RAG pipelines ==== Pydantic ==== * **Website**: [[https://docs.pydantic.dev/]] * The de facto standard for schema definition in Python-based structured output tools * JSON Schema generation used by most libraries above ===== Evaluation and Benchmarks ===== * **JSONSchemaBench** (2025): Systematic benchmark evaluating constrained decoding across efficiency, coverage, and quality. Reveals that coverage drops on complex schemas (nested objects, conditional fields) even for leading frameworks. * **SLOTBench** (EMNLP 2025): Evaluates post-processing approaches on schema accuracy and content fidelity across diverse domains. * Key finding: Supervised fine-tuning combined with constrained decoding produces the best results; neither alone is sufficient for complex schemas. ===== Best Practices ===== - **Use native structured output modes** when available ([[openai|OpenAI]] strict mode, [[anthropic|Anthropic]] tool use) for highest reliability - **Define schemas with Pydantic** for type safety, validation, and automatic JSON Schema generation - **Include descriptions in schema fields** to guide model generation with semantic context - **Use constrained decoding** for [[open_weight_models|open-weight models]] to guarantee compliance - **Implement retry with feedback**: On validation failure, pass the error back to the model for correction (the approach [[instructor_framework|Instructor]] uses) - **Keep schemas simple**: Deeply nested or highly conditional schemas reduce reliability across all approaches - **Test with JSONSchemaBench** or similar benchmarks to evaluate reliability before production deployment ===== See Also ===== * [[structured_output|Structured Output Generation]] * [[structured_extraction|Structured Extraction]] * [[file_based_output_generation|File-Based Output Generation]] * [[video_metadata_extraction|Structured Video Metadata Extraction]] * [[tool_use_protocol|Structured Tool-Use Protocol]] ===== References =====