LATM: Large Language Models as Tool Makers

LATM (Large Language Models as Tool Makers) is a cost-efficient agent framework introduced by Cai et al. (2023) that implements a division of labor: a strong LLM (GPT-4) creates reusable Python tools, while a weaker LLM (GPT-3.5) uses them for inference.¹⁾ With 271 citations, it demonstrates that this tool-making/tool-using paradigm achieves near-GPT-4 performance at a fraction of the cost by amortizing expensive tool creation across many lightweight invocations.

arXiv:2305.17126

Tool-Making / Tool-Using Paradigm

LATM draws an analogy to human technological evolution: sophisticated tools are created once by skilled craftspeople, then used repeatedly by the general population. The framework separates the cognitive burden of tool creation from tool application.²⁾

The cost model motivates the approach. For $n$ problem instances, direct GPT-4 inference costs:

$$C_{\text{direct}} = n \cdot c_{\text{GPT-4}}$$

With LATM, the amortized cost becomes:

$$C_{\text{LATM}} = k \cdot c_{\text{GPT-4}} + n \cdot c_{\text{GPT-3.5}}$$

where $k$ is the small number of demonstrations used for tool creation. Since $c_{\text{GPT-3.5}} \ll c_{\text{GPT-4}}$ and $k \ll n$, LATM achieves substantial savings.

Three-Phase Tool Making

Phase 1: Tool Proposing

GPT-4 receives $k$ task demonstrations (typically 3) and generates a generic, reusable Python function that solves the demonstrated problem pattern.

Phase 2: Tool Verification

The proposed tool is tested on held-out validation examples. If it fails, GPT-4 iterates on the implementation until correctness is achieved.

Phase 3: Tool Wrapping

The verified function is packaged with an API-friendly interface (docstring, type hints, usage examples) and cached in a tool repository for future use.

Dispatcher

When a new task arrives, a lightweight dispatcher module determines whether an existing cached tool applies or if a new tool must be created. This routes tasks to the appropriate cached tool, minimizing redundant tool creation.

System Architecture

graph TD A[New Task] --> B{Dispatcher} B -- Existing Tool Found --> C[Tool Repository Cache] B -- No Tool Available --> D[Tool Maker: GPT-4] D --> E[Phase 1: Propose Python Function] E --> F[Phase 2: Verify on Held-out Examples] F --> G{Passes Validation?} G -- No --> E G -- Yes --> H[Phase 3: Wrap as API] H --> C C --> I[Tool User: GPT-3.5] I --> J[Invoke Cached Tool via API] J --> K[Task Solution] L[k=3 Demonstrations] --> D

===== Code Example =====³⁾

# Simplified LATM framework
class LATM:
    def __init__(self, tool_maker_llm, tool_user_llm):
        self.maker = tool_maker_llm    # GPT-4
        self.user = tool_user_llm      # GPT-3.5
        self.tool_cache = {}
 
    def make_tool(self, demonstrations, task_type):
        # Phase 1: Propose
        prompt = (
            f"Given these examples:\n{demonstrations}\n"
            "Write a general Python function that solves this type of problem."
        )
        tool_code = self.maker.generate(prompt)
 
        # Phase 2: Verify on held-out examples
        validation_set = demonstrations[-1:]
        if not self._validate(tool_code, validation_set):
            tool_code = self._iterate_fix(tool_code, validation_set)
 
        # Phase 3: Wrap and cache
        wrapped_tool = self._wrap_as_api(tool_code, task_type)
        self.tool_cache[task_type] = wrapped_tool
        return wrapped_tool
 
    def dispatch(self, task):
        task_type = self.user.classify_task(task)
        if task_type not in self.tool_cache:
            demos = self._get_demonstrations(task_type)
            self.make_tool(demos, task_type)
        return self.tool_cache[task_type]
 
    def solve(self, task):
        tool = self.dispatch(task)
        prompt = f"Use this tool to solve the problem:\n{tool.api_doc}\n\nTask: {task}"
        return self.user.generate(prompt)

Key Results

GPT-4 as tool maker + GPT-3.5 as tool user matches GPT-4 end-to-end performance⁴⁾
Significant cost reduction: tool-making cost is amortized across all task instances
Evaluated on Big-Bench tasks including logical deduction (e.g., ordering objects from constraints)
Tools generalize well across problem instances within the same task family
Tool verification ensures correctness before deployment to the weaker model
The paradigm extends to any strong/weak model pair⁵⁾