====== API-Bank Benchmark ======
**API-Bank** is a comprehensive benchmark introduced by [[https://arxiv.org/abs/2304.08244|Li et al., 2023]] for evaluating the tool-use capabilities of large language models across a diverse set of real-world APIs. It provides both evaluation data and training data, making it one of the first complete benchmarks for tool-augmented LLM research. API-Bank addresses three key questions: how effective are current LLMs at using tools, how can tool use be improved, and what obstacles remain.((https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs))((Schick, T. et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." [[https://arxiv.org/abs/2302.04761|arXiv:2302.04761]], 2023.))((Qin, Y. et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." [[https://arxiv.org/abs/2307.16789|arXiv:2307.16789]], 2023.))

  * **Paper:** [[https://arxiv.org/abs/2304.08244|arXiv:2304.08244]] (April 2023)
  * **Venue:** EMNLP 2023
  * **Authors:** Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li
  * **Institution:** [[alibaba|Alibaba]] Group, HKUST, Peking University

===== Benchmark Components =====
=== Evaluation System ===
  * **73 API tools** spanning domains including search, weather, finance, social media, email, calendar, and more
  * **314 annotated tool-use dialogues** with 753 API calls
  * **Runnable evaluation:** APIs are actually executable, not just simulated
  * Tests three core capabilities: **planning** (deciding which APIs to call), **retrieval** (finding the right API from the available set), and **calling** (generating correct parameters)

=== Training Dataset ===
  * **1,888 tool-use dialogues** constructed from **2,138 APIs** across **1,000 distinct domains**
  * Designed for fine-tuning models to improve [[tool_utilization|tool utilization]]
  * Used to train **Lynx**, a tool-augmented LLM initialized from Alpaca

===== Evaluation Dimensions =====
API-Bank tests LLMs across three levels of increasing difficulty:

**Level 1 - API Calling:** Can the model correctly call a single API given its documentation? Tests parameter extraction and formatting.

**Level 2 - API Retrieval + Calling:** Can the model identify the correct API from a set and call it properly? Tests tool selection from multiple options.

**Level 3 - Planning + Retrieval + Calling:** Can the model decompose a complex request into multiple API calls, retrieve the right APIs, and execute them in order? Tests multi-step tool-use reasoning.

===== Key Results =====
  * **GPT-4** excels at planning (Level 3) tasks, outperforming other models on complex multi-step tool use
  * **GPT-3.5** shows improved [[tool_utilization|tool utilization]] compared to GPT-3, confirming that scaling helps tool use
  * **Lynx** (fine-tuned Alpaca) surpasses base Alpaca by **26+ points** on [[tool_utilization|tool utilization]] metrics, approaching GPT-3.5 effectiveness((https://arxiv.org/abs/2304.08244|Li et al., 2023 - API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs))
  * Significant room for improvement remains across all models, particularly in:
    * Correctly parsing API documentation
    * Handling ambiguous user requests
    * Managing multi-step API dependencies

===== Importance for Tool-Use Evaluation =====
API-Bank was among the first benchmarks to provide:

  * **Executable evaluation:** Real API calls, not just string matching against expected outputs
  * **Multi-level assessment:** Separate measurement of planning, retrieval, and execution capabilities
  * **Training data alongside evaluation:** Enabling both assessment and improvement of tool use
  * **Domain diversity:** 1,000 domains covering real-world API usage patterns

It established the methodology for subsequent tool-use benchmarks like ToolBench, MINT, and T-Eval, and informed the development of [[function_calling|function calling]] APIs and [[toolformer|self-supervised tool learning]] approaches.

===== Related Pages =====
  * [[tool_augmented_language_models|Tool-Augmented Language Models]]
  * [[toolformer|Toolformer]]
  * [[tool_utilization|Tool Utilization]]
  * [[function_calling|OpenAI Function Calling]]
  * [[mrkl_systems|MRKL Systems]]

===== See Also =====
  * [[proximal_labs_frontierswe|Proximal Labs FrontierSWE]]
  * [[toolllm|ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs]]
  * [[swe_bench_verified|SWE-bench Verified]]
  * [[best_value_ai_2026|Best Value AI 2026]]
  * [[vals_index|Vals Index]]

===== References =====