====== API-Bank Benchmark ====== **API-Bank** is a comprehensive benchmark introduced by [[https://arxiv.org/abs/2304.08244|Li et al., 2023]] for evaluating the tool-use capabilities of large language models across a diverse set of real-world APIs. It provides both evaluation data and training data, making it one of the first complete benchmarks for tool-augmented LLM research. API-Bank addresses three key questions: how effective are current LLMs at using tools, how can tool use be improved, and what obstacles remain.((https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs))((Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." [[https://arxiv.org/abs/2302.04761|arXiv:2302.04761]], 2023.))((Qin, Y. et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." [[https://arxiv.org/abs/2307.16789|arXiv:2307.16789]], 2023.)) * **Paper:** [[https://arxiv.org/abs/2304.08244|arXiv:2304.08244]] (April 2023) * **Venue:** EMNLP 2023 * **Authors:** Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li * **Institution:** [[alibaba|Alibaba]] Group, HKUST, Peking University ===== Benchmark Components ===== === Evaluation System === * **73 API tools** spanning domains including search, weather, finance, social media, email, calendar, and more * **314 annotated tool-use dialogues** with 753 API calls * **Runnable evaluation:** APIs are actually executable, not just simulated * Tests three core capabilities: **planning** (deciding which APIs to call), **retrieval** (finding the right API from the available set), and **calling** (generating correct parameters) === Training Dataset === * **1,888 tool-use dialogues** constructed from **2,138 APIs** across **1,000 distinct domains** * Designed for fine-tuning models to improve [[tool_utilization|tool utilization]] * Used to train **Lynx**, a tool-augmented LLM initialized from Alpaca ===== Evaluation Dimensions ===== API-Bank tests LLMs across three levels of increasing difficulty: **Level 1 - API Calling:** Can the model correctly call a single API given its documentation? Tests parameter extraction and formatting. **Level 2 - API Retrieval + Calling:** Can the model identify the correct API from a set and call it properly? Tests tool selection from multiple options. **Level 3 - Planning + Retrieval + Calling:** Can the model decompose a complex request into multiple API calls, retrieve the right APIs, and execute them in order? Tests multi-step tool-use reasoning. ===== Key Results ===== * **GPT-4** excels at planning (Level 3) tasks, outperforming other models on complex multi-step tool use * **GPT-3.5** shows improved [[tool_utilization|tool utilization]] compared to GPT-3, confirming that scaling helps tool use * **Lynx** (fine-tuned Alpaca) surpasses base Alpaca by **26+ points** on [[tool_utilization|tool utilization]] metrics, approaching GPT-3.5 effectiveness((https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs)) * Significant room for improvement remains across all models, particularly in: * Correctly parsing API documentation * Handling ambiguous user requests * Managing multi-step API dependencies ===== Importance for Tool-Use Evaluation ===== API-Bank was among the first benchmarks to provide: * **Executable evaluation:** Real API calls, not just string matching against expected outputs * **Multi-level assessment:** Separate measurement of planning, retrieval, and execution capabilities * **Training data alongside evaluation:** Enabling both assessment and improvement of tool use * **Domain diversity:** 1,000 domains covering real-world API usage patterns It established the methodology for subsequent tool-use benchmarks like ToolBench, MINT, and T-Eval, and informed the development of [[function_calling|function calling]] APIs and [[toolformer|self-supervised tool learning]] approaches. ===== Related Pages ===== * [[tool_augmented_language_models|Tool-Augmented Language Models]] * [[toolformer|Toolformer]] * [[tool_utilization|Tool Utilization]] * [[function_calling|OpenAI Function Calling]] * [[mrkl_systems|MRKL Systems]] ===== See Also ===== * [[proximal_labs_frontierswe|Proximal Labs FrontierSWE]] * [[toolllm|ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs]] * [[swe_bench_verified|SWE-bench Verified]] * [[best_value_ai_2026|Best Value AI 2026]] * [[vals_index|Vals Index]] ===== References =====