====== LATM: Large Language Models as Tool Makers ======
**LATM** (Large Language Models as Tool Makers) is a cost-efficient agent framework introduced by Cai et al. (2023) that implements a division of labor: a **strong LLM (GPT-4) creates reusable Python tools**, while a **weaker LLM (GPT-3.5) uses them** for inference.(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]])) With **271 citations**, it demonstrates that this tool-making/tool-using paradigm achieves near-GPT-4 performance at a fraction of the cost by amortizing expensive tool creation across many lightweight invocations.
[[https://arxiv.org/abs/2305.17126|arXiv:2305.17126]]
===== Tool-Making / Tool-Using Paradigm =====
LATM draws an analogy to human technological evolution: sophisticated tools are created once by skilled craftspeople, then used repeatedly by the general population. The framework separates the cognitive burden of tool creation from tool application.(([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)]]))
The cost model motivates the approach. For $n$ problem instances, direct GPT-4 inference costs:
$$C_{\text{direct}} = n \cdot c_{\text{GPT-4}}$$
With LATM, the amortized cost becomes:
$$C_{\text{LATM}} = k \cdot c_{\text{GPT-4}} + n \cdot c_{\text{GPT-3.5}}$$
where $k$ is the small number of demonstrations used for tool creation. Since $c_{\text{GPT-3.5}} \ll c_{\text{GPT-4}}$ and $k \ll n$, LATM achieves substantial savings.
===== Three-Phase Tool Making =====
==== Phase 1: Tool Proposing ====
GPT-4 receives $k$ task demonstrations (typically 3) and generates a generic, reusable Python function that solves the demonstrated problem pattern.
==== Phase 2: Tool Verification ====
The proposed tool is tested on held-out validation examples. If it fails, GPT-4 iterates on the implementation until correctness is achieved.
==== Phase 3: Tool Wrapping ====
The verified function is packaged with an API-friendly interface (docstring, type hints, usage examples) and cached in a tool repository for future use.
===== Dispatcher =====
When a new task arrives, a lightweight **dispatcher** module determines whether an existing cached tool applies or if a new tool must be created. This routes tasks to the appropriate cached tool, minimizing redundant tool creation.
===== System Architecture =====
graph TD
A[New Task] --> B{Dispatcher}
B -- Existing Tool Found --> C[Tool Repository Cache]
B -- No Tool Available --> D[Tool Maker: GPT-4]
D --> E[Phase 1: Propose Python Function]
E --> F[Phase 2: Verify on Held-out Examples]
F --> G{Passes Validation?}
G -- No --> E
G -- Yes --> H[Phase 3: Wrap as API]
H --> C
C --> I[Tool User: GPT-3.5]
I --> J[Invoke Cached Tool via API]
J --> K[Task Solution]
L[k=3 Demonstrations] --> D
===== Code Example =====(([[https://github.com/ctlllll/LLM-ToolMaker|LATM GitHub Repository]]))
# Simplified LATM framework
class LATM:
def __init__(self, tool_maker_llm, tool_user_llm):
self.maker = tool_maker_llm # GPT-4
self.user = tool_user_llm # GPT-3.5
self.tool_cache = {}
def make_tool(self, demonstrations, task_type):
# Phase 1: Propose
prompt = (
f"Given these examples:\n{demonstrations}\n"
"Write a general Python function that solves this type of problem."
)
tool_code = self.maker.generate(prompt)
# Phase 2: Verify on held-out examples
validation_set = demonstrations[-1:]
if not self._validate(tool_code, validation_set):
tool_code = self._iterate_fix(tool_code, validation_set)
# Phase 3: Wrap and cache
wrapped_tool = self._wrap_as_api(tool_code, task_type)
self.tool_cache[task_type] = wrapped_tool
return wrapped_tool
def dispatch(self, task):
task_type = self.user.classify_task(task)
if task_type not in self.tool_cache:
demos = self._get_demonstrations(task_type)
self.make_tool(demos, task_type)
return self.tool_cache[task_type]
def solve(self, task):
tool = self.dispatch(task)
prompt = f"Use this tool to solve the problem:\n{tool.api_doc}\n\nTask: {task}"
return self.user.generate(prompt)
===== Key Results =====
* GPT-4 as tool maker + GPT-3.5 as tool user **matches GPT-4 end-to-end performance**(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]]))
* Significant cost reduction: tool-making cost is amortized across all task instances
* Evaluated on **Big-Bench tasks** including logical deduction (e.g., ordering objects from constraints)
* Tools generalize well across problem instances within the same task family
* Tool verification ensures correctness before deployment to the weaker model
* The paradigm extends to any strong/weak model pair(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]]))
===== See Also =====
* [[toolllm|ToolLLM: Mastering 16,000+ Real-World APIs]]
* [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]]
* [[reasoning_via_planning|RAP: Reasoning via Planning]]
===== References =====