====== LATM: Large Language Models as Tool Makers ====== **LATM** (Large Language Models as Tool Makers) is a cost-efficient agent framework introduced by Cai et al. (2023) that implements a division of labor: a **strong LLM (GPT-4) creates reusable Python tools**, while a **weaker LLM (GPT-3.5) uses them** for inference.(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]])) With **271 citations**, it demonstrates that this tool-making/tool-using paradigm achieves near-GPT-4 performance at a fraction of the cost by amortizing expensive tool creation across many lightweight invocations. [[https://arxiv.org/abs/2305.17126|arXiv:2305.17126]] ===== Tool-Making / Tool-Using Paradigm ===== LATM draws an analogy to human technological evolution: sophisticated tools are created once by skilled craftspeople, then used repeatedly by the general population. The framework separates the cognitive burden of tool creation from tool application.(([[https://arxiv.org/abs/2302.04761|Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023)]])) The cost model motivates the approach. For $n$ problem instances, direct GPT-4 inference costs: $$C_{\text{direct}} = n \cdot c_{\text{GPT-4}}$$ With LATM, the amortized cost becomes: $$C_{\text{LATM}} = k \cdot c_{\text{GPT-4}} + n \cdot c_{\text{GPT-3.5}}$$ where $k$ is the small number of demonstrations used for tool creation. Since $c_{\text{GPT-3.5}} \ll c_{\text{GPT-4}}$ and $k \ll n$, LATM achieves substantial savings. ===== Three-Phase Tool Making ===== ==== Phase 1: Tool Proposing ==== GPT-4 receives $k$ task demonstrations (typically 3) and generates a generic, reusable Python function that solves the demonstrated problem pattern. ==== Phase 2: Tool Verification ==== The proposed tool is tested on held-out validation examples. If it fails, GPT-4 iterates on the implementation until correctness is achieved. ==== Phase 3: Tool Wrapping ==== The verified function is packaged with an API-friendly interface (docstring, type hints, usage examples) and cached in a tool repository for future use. ===== Dispatcher ===== When a new task arrives, a lightweight **dispatcher** module determines whether an existing cached tool applies or if a new tool must be created. This routes tasks to the appropriate cached tool, minimizing redundant tool creation. ===== System Architecture ===== graph TD A[New Task] --> B{Dispatcher} B -- Existing Tool Found --> C[Tool Repository Cache] B -- No Tool Available --> D[Tool Maker: GPT-4] D --> E[Phase 1: Propose Python Function] E --> F[Phase 2: Verify on Held-out Examples] F --> G{Passes Validation?} G -- No --> E G -- Yes --> H[Phase 3: Wrap as API] H --> C C --> I[Tool User: GPT-3.5] I --> J[Invoke Cached Tool via API] J --> K[Task Solution] L[k=3 Demonstrations] --> D ===== Code Example =====(([[https://github.com/ctlllll/LLM-ToolMaker|LATM GitHub Repository]])) # Simplified LATM framework class LATM: def __init__(self, tool_maker_llm, tool_user_llm): self.maker = tool_maker_llm # GPT-4 self.user = tool_user_llm # GPT-3.5 self.tool_cache = {} def make_tool(self, demonstrations, task_type): # Phase 1: Propose prompt = ( f"Given these examples:\n{demonstrations}\n" "Write a general Python function that solves this type of problem." ) tool_code = self.maker.generate(prompt) # Phase 2: Verify on held-out examples validation_set = demonstrations[-1:] if not self._validate(tool_code, validation_set): tool_code = self._iterate_fix(tool_code, validation_set) # Phase 3: Wrap and cache wrapped_tool = self._wrap_as_api(tool_code, task_type) self.tool_cache[task_type] = wrapped_tool return wrapped_tool def dispatch(self, task): task_type = self.user.classify_task(task) if task_type not in self.tool_cache: demos = self._get_demonstrations(task_type) self.make_tool(demos, task_type) return self.tool_cache[task_type] def solve(self, task): tool = self.dispatch(task) prompt = f"Use this tool to solve the problem:\n{tool.api_doc}\n\nTask: {task}" return self.user.generate(prompt) ===== Key Results ===== * GPT-4 as tool maker + GPT-3.5 as tool user **matches GPT-4 end-to-end performance**(([[https://arxiv.org/abs/2305.17126|Cai et al. "Large Language Models as Tool Makers" (2023)]])) * Significant cost reduction: tool-making cost is amortized across all task instances * Evaluated on **Big-Bench tasks** including logical deduction (e.g., ordering objects from constraints) * Tools generalize well across problem instances within the same task family * Tool verification ensures correctness before deployment to the weaker model * The paradigm extends to any strong/weak model pair(([[https://arxiv.org/abs/2307.16789|Qin et al. "ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs" (2023)]])) ===== See Also ===== * [[toolllm|ToolLLM: Mastering 16,000+ Real-World APIs]] * [[chemcrow|ChemCrow: LLM Agent with Chemistry Tools]] * [[reasoning_via_planning|RAP: Reasoning via Planning]] ===== References =====