====== Toolformer ======
**Toolformer** is a research approach introduced by(([[https://arxiv.org/abs/2302.04761|Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, February 2023]])) ([[meta|Meta]] AI) in February 2023 that trains language models to autonomously decide when and how to call external tools by generating API calls inline within text sequences. The model learns tool usage in a self-supervised manner, requiring only a handful of demonstrations per API, no explicit human annotations of when tools should be used. Toolformer demonstrated that smaller models augmented with tools can match or exceed the performance of much larger models.
* **Paper:** [[https://arxiv.org/abs/2302.04761|Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, February 2023]]
* **Authors:** Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
* **Venue:** NeurIPS 2023
* **Organization:** [[meta|Meta]] AI Research
graph LR
T[Text Dataset] --> S[Sample API Calls]
S --> X[Execute Calls]
X --> F[Filter by [[perplexity_ai|Perplexity]]]
F --> FT[Fine-Tune Model]
FT --> I[Inference with Tools]
style T fill:#69f,stroke:#333
style I fill:#6f6,stroke:#333
===== Self-Supervised Training Approach =====
Toolformer's key innovation is its training methodology:
- **API Call Sampling:** Given a dataset of text, the model samples potential positions where API calls could be inserted, generating candidate calls with appropriate arguments
- **Execution:** Each candidate API call is actually executed against the real tool
- **Filtering:** Only API calls that **reduce [[perplexity_ai|perplexity]]** on subsequent tokens are retained, meaning only calls that genuinely help predict future text survive
- **Fine-tuning:** The model is fine-tuned on the filtered dataset, learning to generate API call tokens naturally within text sequences
This approach means the model learns //when// a tool is useful (not just //how// to use it), without requiring human-labeled training data specifying tool usage points.
===== Perplexity Filtering Formula =====
The core filtering criterion compares the [[perplexity_ai|perplexity]] of tokens following a potential API call position, with and without the tool result. A candidate API call $c$ with result $r$ at position $i$ in text $x$ is retained if:
$$L_i(c, r) - L_i(\emptyset) \geq \tau$$
where $L_i$ denotes the cross-entropy loss over subsequent tokens:
$$L_i(c, r) = -\sum_{j=i}^{n} \log p_\theta(x_j \mid x_{
import re
from [[openai|openai]] import [[openai|OpenAI]]
client = [[openai|OpenAI]]()
# Simulate the Toolformer pattern: model generates text with inline API calls
# API calls appear as [ToolName(args) -> result] tokens in the output
TOOL_IMPLEMENTATIONS = {
"Calculator": lambda expr: str(eval(expr)),
"Search": lambda query: f"Python 3.12 was released on October 2, 2023.",
"Calendar": lambda: "Today is 2025-03-24, Monday.",
}
def execute_inline_calls(text: str) -> str:
"""Parse and execute Toolformer-style inline API calls in generated text."""
pattern = r"\[(\w+)\(([[^]]))*)\)\]"
def replacer(match):
tool_name, args = match.group(1), match.group(2)
if tool_name in TOOL_IMPLEMENTATIONS:
func = TOOL_IMPLEMENTATIONStool_name
result = func(args) if args else func()
return f"[{tool_name}({args}) -> {result}]"
return match.group(0)
return re.sub(pattern, replacer, text)
def toolformer_generate(prompt: str) -> str:
"""Generate text that may contain inline tool calls, then execute them."""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": (
"You can insert tool calls inline using this syntax: [ToolName(args)]\n"
"Available tools: [Calculator(expr)], [Search(query)], [Calendar()]\n"
"Insert them naturally where they help answer accurately."
)}, {"role": "user", "content": prompt}],
)
raw_output = resp.choices[0].message.content
print(f"Raw: {raw_output}")
# Execute any inline API calls and substitute results
resolved = execute_inline_calls(raw_output)
print(f"Resolved: {resolved}")
return resolved
toolformer_generate("What is 347 * 23, and when was the latest Python released?")
===== Tools Incorporated =====
Toolformer demonstrated integration with five types of tools:
* **Calculator:** Arithmetic operations for precise mathematical computation
* **Q&A System:** Question-answering module for factual knowledge retrieval
* **Search Engine:** Web search for current information (Wikipedia-based)
* **Translation System:** Machine translation between languages
* **Calendar:** Date and time lookups
API calls are represented as special tokens within the text sequence: ''[Calculator(3+5) -> 8]'', allowing the model to seamlessly interleave tool use with generation.
===== Key Results =====
* Substantially improved zero-shot performance across downstream tasks(([[https://arxiv.org/abs/2302.04761|Schick et al. arXiv:2302.04761]]))
* Often competitive with much larger models (e.g., GPT-3 175B) while using a 6.7B parameter model
* **Did not sacrifice** core language modeling abilities, the model retains general text generation quality
* Demonstrated that tool augmentation is a viable alternative to simply scaling model size
===== Influence on Later Work =====
Toolformer established several principles that shaped subsequent tool-augmented AI research:
* **Self-supervised tool learning** is viable, models can discover when tools help without explicit supervision
* **Inline API calls** as a generation pattern influenced how modern models represent tool use
* **[[perplexity_ai|Perplexity]]-based filtering** showed how to automatically curate tool-use training data
* Directly influenced the design of [[function_calling|OpenAI Function Calling]], [[anthropic_context_protocol|MCP]], and provider tool-use APIs
* The "Augmented Language Models" survey(([[https://arxiv.org/abs/2302.07842|Mialon, G. et al. "Augmented Language Models: a Survey." arXiv:2302.07842, 2023.]])) from the same [[meta|Meta]] AI team contextualized Toolformer within the broader TALM paradigm
===== Limitations =====
* Training requires executing API calls at scale, which is computationally expensive
* Limited to tools with simple text-in/text-out interfaces
* The [[perplexity_ai|perplexity]] filter may miss tools useful for tasks not well-represented in training data
* No support for multi-turn tool interactions or complex tool chains
===== See Also =====
* [[tool_augmented_language_models|Tool-Augmented Language Models]]
* [[toolllm|ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs]]
* [[tool_search_mechanism|Tool Search Mechanism]]
* [[tool_use|Tool Use for LLM Agents]]
* [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
===== References =====