Tool-Augmented Language Models

Tool-Augmented Language Models (TALMs) extend the capabilities of large language models by enabling them to invoke external tools such as search engines, calculators, code interpreters, and APIs during the generation process¹⁾. Rather than relying solely on parametric knowledge stored in model weights, TALMs learn when and how to delegate sub-tasks to specialized tools, significantly improving factual accuracy, computational reliability, and access to current information. The TALM paradigm represents one of the most important developments in making LLMs practical for real-world applications.

Core Concept

Traditional LLMs generate text purely from learned parameters, which leads to well-known limitations:

Stale knowledge: Training data has a cutoff date
Hallucination: Models confabulate facts with high confidence
Poor computation: Arithmetic, logic, and precise reasoning are unreliable
No world interaction: Models cannot take actions or access external systems

TALMs address these by learning to generate tool calls (API invocations, function calls, code execution) as part of their output, then incorporating tool results back into generation. The key challenge is teaching models when tool use is beneficial, which tool to use, and how to formulate correct calls.

Key Papers and Approaches

Toolformer (Meta AI, 2023)

Toolformer²⁾ pioneered self-supervised tool learning: the model samples potential API calls, executes them, and retains only those that reduce perplexity on subsequent tokens. This eliminated the need for human-annotated tool-use data. Integrated tools: calculator, Q&A, search, translation, calendar.

Augmented Language Models Survey (Meta AI, 2023)

Mialon et al., 2023 provided the foundational survey categorizing augmented LMs into those enhanced with:

Reasoning: Decomposing complex tasks into subtasks (chain-of-thought, scratchpads)
Tool use: Calling external modules (code interpreters, search engines, calculators)
Acting: Taking actions in environments (web browsing, API calls, robotics)

The survey coined the term Augmented Language Models (ALMs) and established the theoretical framework connecting reasoning, tool use, and action.

MRKL Systems (AI21 Labs, 2022)

MRKL ³⁾ formalized the neuro-symbolic routing approach: an LLM router dispatches sub-tasks to neural and symbolic expert modules.

HuggingGPT / JARVIS (2023)

HuggingGPT⁴⁾ extended tool augmentation to orchestrating hundreds of specialized AI models via a four-stage planning pipeline.

Tool Learning Survey (Qu et al., 2025)

The comprehensive “Tool Learning with Large Language Models: A Survey”⁵⁾ synthesized the field into a unified four-stage framework:

Task Planning: Decompose complex requests into subtasks
Tool Selection: Choose appropriate tools from available options
Task Execution: Invoke tools with correct parameters
Response Generation: Integrate tool outputs into coherent responses

Learning Methodologies

Three main approaches for teaching models to use tools:

Tuning-Free Methods: Prompt-based approaches where tool use is guided by instructions and few-shot examples without model modification. Used by ReAct⁶⁾ and chain-of-thought with tools.

Supervised Fine-Tuning: Training on datasets of (input, tool_call, output) examples. Used by Toolformer, Lynx (from API-Bank), and Gorilla⁷⁾.

Reinforcement Learning: Training models to optimize tool-use policies through reward signals. Enables learning from execution feedback and optimizing for task completion rather than just matching training examples.

Modern Implementation

The TALM concept is realized in production through several mechanisms:

Function Calling: Provider-native structured tool invocation via JSON Schema
Model Context Protocol (MCP): Universal protocol connecting models to tool servers
Plugin Architectures: Modular tool loading and discovery systems
Agent Frameworks: LangChain, LlamaIndex, CrewAI providing tool-use abstractions

Code Example: Self-Supervised Tool Call Decision

from [[openai|openai]] import [[openai|OpenAI]]
import json
 
client = [[openai|OpenAI]]()
 
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string", "description": "Math expression to evaluate"}},
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "Search query"}},
                "required": ["query"],
            },
        },
    },
]
 
 
def execute_tool(name: str, args: dict) -> str:
    if name == "calculator":
        try:
            return str(eval(args["expression"]))
        except Exception as e:
            return f"Error: {e}"
    if name == "web_search":
        return f"[Search results for '{args['query']}': Top result about {args['query']}]"
    return f"Unknown tool: {name}"
 
 
def talm_query(user_input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Use tools when they would improve accuracy."},
            {"role": "user", "content": user_input},
        ],
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message
 
    if not msg.tool_calls:
        return msg.content
 
    messages = [
        {"role": "system", "content": "Use tools when they would improve accuracy."},
        {"role": "user", "content": user_input},
        msg,
    ]
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)
        result = execute_tool(tc.function.name, args)
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})
 
    final = client.chat.completions.create(model="gpt-4o", messages=messages)
    return final.choices[0].message.content

Evaluation

Tool-augmented capabilities are measured by:

API-Bank⁸⁾:* 73 APIs, 314 dialogues, three-level evaluation
ToolBench⁹⁾:* Large-scale real-world API evaluation (16,000+ APIs)
MINT: Multi-turn interactive tool-use scenarios
T-Eval: Fine-grained assessment of selection, parameter generation, and error handling

Related Pages

References

¹⁾

Mialon et al. "Augmented Language Models: a Survey." arXiv:2302.07842, 2023.

²⁾

Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 2023.

³⁾

Karpas, E. et al. "MRKL Systems" arXiv:2205.00445, 2022.

⁴⁾

Shen, Y. et al. "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face." arXiv:2303.17580, 2023.

⁵⁾

Qu, C. et al. "Tool Learning with Large Language Models: A Survey." arXiv:2407.03364, 2024.

⁶⁾

Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629, 2022.

⁷⁾

Patil, S. et al. "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334, 2023.

⁸⁾

Li, M. et al. "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs." arXiv:2304.08244, 2023.

⁹⁾

Qin, Y. et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789, 2023.

AI Agent Knowledge Base

Sidebar

Table of Contents

Tool-Augmented Language Models

Core Concept

Key Papers and Approaches

Toolformer (Meta AI, 2023)

Augmented Language Models Survey (Meta AI, 2023)

MRKL Systems (AI21 Labs, 2022)

HuggingGPT / JARVIS (2023)

Tool Learning Survey (Qu et al., 2025)

Learning Methodologies

Modern Implementation

Code Example: Self-Supervised Tool Call Decision

Evaluation

Related Pages

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Tool-Augmented Language Models

Core Concept

Key Papers and Approaches

Toolformer (Meta AI, 2023)

Augmented Language Models Survey (Meta AI, 2023)

MRKL Systems (AI21 Labs, 2022)

HuggingGPT / JARVIS (2023)

Tool Learning Survey (Qu et al., 2025)

Learning Methodologies

Modern Implementation

Code Example: Self-Supervised Tool Call Decision

Evaluation

Related Pages

See Also

References

Page Tools