Toolformer

Toolformer is a research approach introduced by¹⁾ (Meta AI) in February 2023 that trains language models to autonomously decide when and how to call external tools by generating API calls inline within text sequences. The model learns tool usage in a self-supervised manner, requiring only a handful of demonstrations per API, no explicit human annotations of when tools should be used. Toolformer demonstrated that smaller models augmented with tools can match or exceed the performance of much larger models.

Paper: Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, February 2023
Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
Venue: NeurIPS 2023
Organization: Meta AI Research

graph LR T[Text Dataset] --> S[Sample API Calls] S --> X[Execute Calls] X --> F[Filter by <a href='/perplexity_ai' class='wikilink2' title='perplexity_ai' rel='nofollow' data-wiki-id='perplexity_ai'>Perplexity]</a> F --> FT[Fine-Tune Model] FT --> I[Inference with Tools] style T fill:#69f,stroke:#333 style I fill:#6f6,stroke:#333

Self-Supervised Training Approach

Toolformer's key innovation is its training methodology:

API Call Sampling: Given a dataset of text, the model samples potential positions where API calls could be inserted, generating candidate calls with appropriate arguments
Execution: Each candidate API call is actually executed against the real tool
Filtering: Only API calls that reduce perplexity on subsequent tokens are retained, meaning only calls that genuinely help predict future text survive
Fine-tuning: The model is fine-tuned on the filtered dataset, learning to generate API call tokens naturally within text sequences

This approach means the model learns when a tool is useful (not just how to use it), without requiring human-labeled training data specifying tool usage points.

Perplexity Filtering Formula

The core filtering criterion compares the perplexity of tokens following a potential API call position, with and without the tool result. A candidate API call $c$ with result $r$ at position $i$ in text $x$ is retained if:

$$L_i(c, r) - L_i(\emptyset) \geq \tau$$

where $L_i$ denotes the cross-entropy loss over subsequent tokens:

$$L_i(c, r) = -\sum_{j=i}^{n} \log p_\theta(x_j \mid x_{<i}, c, r, x_{i:j-1})$$

and $L_i(\emptyset)$ is the loss without any API call. The threshold $\tau$ controls how much a tool call must help to be kept. Only calls where the tool result sufficiently reduces the loss on future tokens are included in training, ensuring the model learns to invoke tools precisely when they provide useful information.

Python Example: Inline API Call Generation

import re
from [[openai|openai]] import [[openai|OpenAI]]
 
client = [[openai|OpenAI]]()
 
# Simulate the Toolformer pattern: model generates text with inline API calls
# API calls appear as [ToolName(args) -> result] tokens in the output
 
TOOL_IMPLEMENTATIONS = {
    "Calculator": lambda expr: str(eval(expr)),
    "Search": lambda query: f"Python 3.12 was released on October 2, 2023.",
    "Calendar": lambda: "Today is 2025-03-24, Monday.",
}
 
def execute_inline_calls(text: str) -> str:
    """Parse and execute Toolformer-style inline API calls in generated text."""
    pattern = r"\[(\w+)\(([[^]]))*)\)\]"
    def replacer(match):
        tool_name, args = match.group(1), match.group(2)
        if tool_name in TOOL_IMPLEMENTATIONS:
            func = TOOL_IMPLEMENTATIONStool_name
            result = func(args) if args else func()
            return f"[{tool_name}({args}) -> {result}]"
        return match.group(0)
    return re.sub(pattern, replacer, text)
 
def toolformer_generate(prompt: str) -> str:
    """Generate text that may contain inline tool calls, then execute them."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": (
            "You can insert tool calls inline using this syntax: [ToolName(args)]\n"
            "Available tools: [Calculator(expr)], [Search(query)], [Calendar()]\n"
            "Insert them naturally where they help answer accurately."
        )}, {"role": "user", "content": prompt}],
    )
    raw_output = resp.choices[0].message.content
    print(f"Raw: {raw_output}")
    # Execute any inline API calls and substitute results
    resolved = execute_inline_calls(raw_output)
    print(f"Resolved: {resolved}")
    return resolved
 
toolformer_generate("What is 347 * 23, and when was the latest Python released?")

Tools Incorporated

Toolformer demonstrated integration with five types of tools:

Calculator: Arithmetic operations for precise mathematical computation
Q&A System: Question-answering module for factual knowledge retrieval
Search Engine: Web search for current information (Wikipedia-based)
Translation System: Machine translation between languages
Calendar: Date and time lookups

API calls are represented as special tokens within the text sequence: [Calculator(3+5) → 8], allowing the model to seamlessly interleave tool use with generation.

Key Results

Substantially improved zero-shot performance across downstream tasks²⁾
Often competitive with much larger models (e.g., GPT-3 175B) while using a 6.7B parameter model
Did not sacrifice core language modeling abilities, the model retains general text generation quality
Demonstrated that tool augmentation is a viable alternative to simply scaling model size

Influence on Later Work

Toolformer established several principles that shaped subsequent tool-augmented AI research:

Self-supervised tool learning is viable, models can discover when tools help without explicit supervision
Inline API calls as a generation pattern influenced how modern models represent tool use
Perplexity-based filtering showed how to automatically curate tool-use training data
Directly influenced the design of OpenAI Function Calling, MCP, and provider tool-use APIs
The “Augmented Language Models” survey³⁾ from the same Meta AI team contextualized Toolformer within the broader TALM paradigm

Limitations

Training requires executing API calls at scale, which is computationally expensive
Limited to tools with simple text-in/text-out interfaces
The perplexity filter may miss tools useful for tasks not well-represented in training data
No support for multi-turn tool interactions or complex tool chains

References

¹⁾

Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, February 2023

²⁾

Schick et al. arXiv:2302.04761

³⁾

Mialon, G. et al. "Augmented Language Models: a Survey." arXiv:2302.07842, 2023.

AI Agent Knowledge Base

Sidebar

Table of Contents

Toolformer

Self-Supervised Training Approach

Perplexity Filtering Formula

Python Example: Inline API Call Generation

Tools Incorporated

Key Results

Influence on Later Work

Limitations

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Toolformer

Self-Supervised Training Approach

Perplexity Filtering Formula

Python Example: Inline API Call Generation

Tools Incorporated

Key Results

Influence on Later Work

Limitations

See Also

References

Page Tools