====== Tool-Augmented Language Models ======

**Tool-Augmented Language Models (TALMs)** extend the capabilities of large language models by enabling them to invoke external tools such as search engines, calculators, code interpreters, and APIs during the generation process(([[https://arxiv.org/abs/2302.07842|Mialon et al. "Augmented Language Models: a Survey." arXiv:2302.07842, 2023.]])). Rather than relying solely on parametric knowledge stored in [[modelweights|model weights]], TALMs learn when and how to delegate sub-tasks to specialized tools, significantly improving factual accuracy, computational reliability, and access to current information. The TALM paradigm represents one of the most important developments in making LLMs practical for real-world applications.

===== Core Concept =====

Traditional LLMs generate text purely from learned parameters, which leads to well-known limitations:

  * **Stale knowledge:** Training data has a cutoff date
  * **Hallucination:** Models confabulate facts with high confidence
  * **Poor computation:** Arithmetic, logic, and precise reasoning are unreliable
  * **No world interaction:** Models cannot take actions or access external systems

TALMs address these by learning to generate **tool calls** (API invocations, function calls, code execution) as part of their output, then incorporating tool results back into generation. The key challenge is teaching models //when// tool use is beneficial, //which// tool to use, and //how// to formulate correct calls.

===== Key Papers and Approaches =====

=== Toolformer (Meta AI, 2023) ===

[[toolformer|Toolformer]](([[https://arxiv.org/abs/2302.04761|Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 2023.]])) pioneered **self-supervised tool learning**: the model samples potential API calls, executes them, and retains only those that reduce [[perplexity_ai|perplexity]] on subsequent tokens. This eliminated the need for human-annotated tool-use data. Integrated tools: calculator, Q&A, search, translation, calendar.

=== Augmented Language Models Survey (Meta AI, 2023) ===

[[https://arxiv.org/abs/2302.07842|Mialon et al., 2023]] provided the foundational survey categorizing augmented LMs into those enhanced with:
  * **Reasoning:** Decomposing complex tasks into subtasks (chain-of-thought, scratchpads)
  * **Tool use:** Calling external modules (code interpreters, search engines, calculators)
  * **Acting:** Taking actions in environments (web browsing, API calls, robotics)

The survey coined the term **Augmented Language Models (ALMs)** and established the theoretical framework connecting reasoning, tool use, and action.

=== MRKL Systems (AI21 Labs, 2022) ===

[[mrkl_systems|MRKL]] (([[https://arxiv.org/abs/2205.00445|Karpas, E. et al. "MRKL Systems" arXiv:2205.00445, 2022.]])) formalized the **neuro-symbolic routing** approach: an LLM router dispatches sub-tasks to neural and symbolic expert modules.

=== HuggingGPT / JARVIS (2023) ===

[[hugginggpt|HuggingGPT]](([[https://arxiv.org/abs/2303.17580|Shen, Y. et al. "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face." arXiv:2303.17580, 2023.]])) extended tool augmentation to orchestrating hundreds of specialized AI models via a four-stage planning pipeline.

=== Tool Learning Survey (Qu et al., 2025) ===

The comprehensive "Tool Learning with Large Language Models: A Survey"(([[https://arxiv.org/abs/2407.03364|Qu, C. et al. "Tool Learning with Large Language Models: A Survey." arXiv:2407.03364, 2024.]])) synthesized the field into a unified four-stage framework:
  - **Task Planning:** Decompose complex requests into subtasks
  - **Tool Selection:** Choose appropriate tools from available options
  - **Task Execution:** Invoke tools with correct parameters
  - **Response Generation:** Integrate tool outputs into coherent responses

===== Learning Methodologies =====

Three main approaches for teaching models to use tools:

**Tuning-Free Methods:** Prompt-based approaches where tool use is guided by instructions and few-shot examples without model modification. Used by [[react_framework|ReAct]](([[https://arxiv.org/abs/2210.03629|Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629, 2022.]])) and chain-of-thought with tools.

**Supervised Fine-Tuning:** Training on datasets of (input, tool_call, output) examples. Used by [[toolformer|Toolformer]], Lynx (from [[api_bank_benchmark|API-Bank]]), and Gorilla(([[https://arxiv.org/abs/2305.15334|Patil, S. et al. "Gorilla: Large Language Model Connected with Massive APIs." arXiv:2305.15334, 2023.]])).

**[[reinforcement_learning|Reinforcement Learning]]:** Training models to optimize tool-use policies through reward signals. Enables learning from execution feedback and optimizing for task completion rather than just matching training examples.

===== Modern Implementation =====

The TALM concept is realized in production through several mechanisms:

  * **[[function_calling|Function Calling]]:** Provider-native structured tool invocation via JSON Schema
  * **[[anthropic_context_protocol|Model Context Protocol (MCP)]]:** Universal protocol connecting models to tool servers
  * **[[tool_integration_patterns|Plugin Architectures]]:** [[modular|Modular]] tool loading and discovery systems
  * **Agent Frameworks:** [[langchain|LangChain]], [[llamaindex|LlamaIndex]], [[crewai|CrewAI]] providing tool-use abstractions

===== Code Example: Self-Supervised Tool Call Decision =====

<code python>
from [[openai|openai]] import [[openai|OpenAI]]
import json

client = [[openai|OpenAI]]()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string", "description": "Math expression to evaluate"}},
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "Search query"}},
                "required": ["query"],
            },
        },
    },
]


def execute_tool(name: str, args: dict) -> str:
    if name == "calculator":
        try:
            return str(eval(args["expression"]))
        except Exception as e:
            return f"Error: {e}"
    if name == "web_search":
        return f"[Search results for '{args['query']}': Top result about {args['query']}]"
    return f"Unknown tool: {name}"


def talm_query(user_input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Use tools when they would improve accuracy."},
            {"role": "user", "content": user_input},
        ],
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message

    if not msg.tool_calls:
        return msg.content

    messages = [
        {"role": "system", "content": "Use tools when they would improve accuracy."},
        {"role": "user", "content": user_input},
        msg,
    ]
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)
        result = execute_tool(tc.function.name, args)
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

    final = client.chat.completions.create(model="gpt-4o", messages=messages)
    return final.choices[0].message.content
</code>

===== Evaluation =====

Tool-augmented capabilities are measured by:

  * **[[api_bank_benchmark|API-Bank]]**(([[https://arxiv.org/abs/2304.08244|Li, M. et al. "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs." arXiv:2304.08244, 2023.]])):* 73 APIs, 314 dialogues, three-level evaluation
  * **ToolBench**(([[https://arxiv.org/abs/2307.16789|Qin, Y. et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789, 2023.]])):* Large-scale real-world API evaluation (16,000+ APIs)
  * **MINT:** Multi-turn interactive tool-use scenarios
  * **T-Eval:** Fine-grained assessment of selection, parameter generation, and error handling

===== Related Pages =====

  * [[toolformer|Toolformer]]
  * [[mrkl_systems|MRKL Systems]]
  * [[hugginggpt|HuggingGPT]]
  * [[function_calling|OpenAI Function Calling]]
  * [[tool_integration_patterns|Tool Integration Patterns]]
  * [[tool_utilization|Tool Utilization]]
  * [[api_bank_benchmark|API-Bank Benchmark]]
  * [[react_framework|ReAct Prompting]]

===== See Also =====

  * [[toolllm|ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs]]
  * [[toolformer|Toolformer]]
  * [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
  * [[forge|Forge]]
  * [[tool_search_mechanism|Tool Search Mechanism]]

===== References =====