====== LATM: Large Language Models as Tool Makers ======

**LATM** (Large Language Models as Tool Makers) is a cost-efficient agent framework introduced by Cai et al. (2023) that implements a division of labor: a **strong LLM (GPT-4) creates reusable Python tools**, while a **weaker LLM (GPT-3.5) uses them** for inference.(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]])) With **271 citations**, it demonstrates that this tool-making/tool-using paradigm achieves near-GPT-4 performance at a fraction of the cost by amortizing expensive tool creation across many lightweight invocations.

[[https://arxiv.org/abs/2305.17126|arXiv:2305.17126]]

===== Tool-Making / Tool-Using Paradigm =====

LATM draws an analogy to human technological evolution: sophisticated tools are created once by skilled craftspeople, then used repeatedly by the general population. The framework separates the cognitive burden of tool creation from tool application.(([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)]]))

The cost model motivates the approach. For $n$ problem instances, direct GPT-4 inference costs:

$$C_{\text{direct}} = n \cdot c_{\text{GPT-4}}$$

With LATM, the amortized cost becomes:

$$C_{\text{LATM}} = k \cdot c_{\text{GPT-4}} + n \cdot c_{\text{GPT-3.5}}$$

where $k$ is the small number of demonstrations used for tool creation. Since $c_{\text{GPT-3.5}} \ll c_{\text{GPT-4}}$ and $k \ll n$, LATM achieves substantial savings.

===== Three-Phase Tool Making =====

==== Phase 1: Tool Proposing ====

GPT-4 receives $k$ task demonstrations (typically 3) and generates a generic, reusable Python function that solves the demonstrated problem pattern.

==== Phase 2: Tool Verification ====

The proposed tool is tested on held-out validation examples. If it fails, GPT-4 iterates on the implementation until correctness is achieved.

==== Phase 3: Tool Wrapping ====

The verified function is packaged with an API-friendly interface (docstring, type hints, usage examples) and cached in a tool repository for future use.

===== Dispatcher =====

When a new task arrives, a lightweight **dispatcher** module determines whether an existing cached tool applies or if a new tool must be created. This routes tasks to the appropriate cached tool, minimizing redundant tool creation.

===== System Architecture =====

<mermaid>
graph TD
    A[New Task] --> B{Dispatcher}
    B -- Existing Tool Found --> C[Tool Repository Cache]
    B -- No Tool Available --> D[Tool Maker: GPT-4]
    D --> E[Phase 1: Propose Python Function]
    E --> F[Phase 2: Verify on Held-out Examples]
    F --> G{Passes Validation?}
    G -- No --> E
    G -- Yes --> H[Phase 3: Wrap as API]
    H --> C
    C --> I[Tool User: GPT-3.5]
    I --> J[Invoke Cached Tool via API]
    J --> K[Task Solution]
    L[k=3 Demonstrations] --> D
</mermaid>

===== Code Example =====(([[https://github.com/ctlllll/LLM-ToolMaker|LATM GitHub Repository]]))

<code python>
# Simplified LATM framework
class LATM:
    def __init__(self, tool_maker_llm, tool_user_llm):
        self.maker = tool_maker_llm    # GPT-4
        self.user = tool_user_llm      # GPT-3.5
        self.tool_cache = {}

    def make_tool(self, demonstrations, task_type):
        # Phase 1: Propose
        prompt = (
            f"Given these examples:\n{demonstrations}\n"
            "Write a general Python function that solves this type of problem."
        )
        tool_code = self.maker.generate(prompt)

        # Phase 2: Verify on held-out examples
        validation_set = demonstrations[-1:]
        if not self._validate(tool_code, validation_set):
            tool_code = self._iterate_fix(tool_code, validation_set)

        # Phase 3: Wrap and cache
        wrapped_tool = self._wrap_as_api(tool_code, task_type)
        self.tool_cache[task_type] = wrapped_tool
        return wrapped_tool

    def dispatch(self, task):
        task_type = self.user.classify_task(task)
        if task_type not in self.tool_cache:
            demos = self._get_demonstrations(task_type)
            self.make_tool(demos, task_type)
        return self.tool_cache[task_type]

    def solve(self, task):
        tool = self.dispatch(task)
        prompt = f"Use this tool to solve the problem:\n{tool.api_doc}\n\nTask: {task}"
        return self.user.generate(prompt)
</code>

===== Key Results =====

  * GPT-4 as tool maker + GPT-3.5 as tool user **matches GPT-4 end-to-end performance**(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]]))
  * Significant cost reduction: tool-making cost is amortized across all task instances
  * Evaluated on **Big-Bench tasks** including logical deduction (e.g., ordering objects from constraints)
  * Tools generalize well across problem instances within the same task family
  * Tool verification ensures correctness before deployment to the weaker model
  * The paradigm extends to any strong/weak model pair(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]]))

===== See Also =====

  * [[toolllm|ToolLLM: Mastering 16,000+ Real-World APIs]]
  * [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning]]

===== References =====