====== Gorilla: LLM Connected with Massive APIs ======
Gorilla is a large language model by Patil et al. (2023) specifically trained for accurate API calling through **retriever-aware training**. By conditioning the model on retrieved API documentation at both training and inference time, Gorilla surpasses GPT-4 on API call accuracy while dramatically reducing hallucinations — the fabrication of non-existent API endpoints or incorrect parameters. The accompanying **APIBench** benchmark provides a standardized evaluation for API-calling capabilities.
===== Retriever-Aware Training =====
Gorilla's core innovation is integrating a document retriever directly into the training pipeline rather than relying on the model's parametric memory:
* **Training data format**: Each example pairs a natural language query with the top-1 retrieved API documentation snippet, formatted as ''[Query] [Retrieved Docs] -> API Call''
* **Retriever**: Uses Contriever, a dense bi-encoder model that embeds queries and documents into 768-dimensional vectors for semantic matching
* **Joint optimization**: The retriever is fine-tuned alongside the LLM on API-specific queries using InfoNCE loss, achieving 95% top-1 retrieval accuracy
* **Base model**: Fine-tuned on LLaMA-7B using supervised instruction tuning on approximately 10K query-document pairs
This approach grounds the model's API calls in actual documentation, preventing the hallucination of non-existent endpoints that plagues general-purpose LLMs.
===== APIBench Benchmark =====
APIBench is a comprehensive benchmark for evaluating end-to-end API calling:
* **Scale**: 1,541 tasks across 62 diverse real-world APIs (AccuWeather, Hugging Face, Twilio, etc.)
* **Splits**: Seen APIs (used during training) and unseen APIs (held out for generalization testing)
* **Metrics**: Exact-match JSON accuracy (correct API name + all parameters), partial credit, and hallucination rate
* **Test conditions**: Zero-shot prompting with and without retrieval, multi-step API chains, authentication edge cases
===== Code Example =====
# Gorilla-style API call generation with retrieval
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load Gorilla model (fine-tuned LLaMA)
model = AutoModelForCausalLM.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
tokenizer = AutoTokenizer.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
def generate_api_call(query: str, retrieved_doc: str) -> str:
"""Generate a grounded API call from query + retrieved documentation."""
prompt = (
f"### User Query: {query}\n"
f"### Retrieved API Documentation:\n{retrieved_doc}\n"
f"### API Call:\n"
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example: grounded API call generation
query = "Get the current weather forecast for San Francisco"
doc = "GET /forecast?city={city}&units={units} - Returns weather data"
api_call = generate_api_call(query, doc)
# Output: {"api_call": "GET /forecast?city=San Francisco&units=metric"}
===== Benchmark Results =====
^ Model ^ Seen APIs ^ Unseen APIs ^ Hallucination Rate ^
| Gorilla-7B | 94.2% | 87.5% | ~5% |
| GPT-4 (zero-shot) | ~75% | ~70% | ~20% |
| LLaMA-7B (vanilla) | ~45% | ~35% | ~40% |
Gorilla reduces API hallucinations by **85-90%** compared to general-purpose LLMs by grounding generation in retrieved documentation.
===== Handling API Version Changes =====
A critical advantage of Gorilla's architecture is **version-agnostic inference**:
* At inference time, the retriever fetches from an up-to-date documentation index (e.g., via FAISS)
* API specification changes are automatically reflected without model retraining
* Experiments swapping documentation versions mid-evaluation show Gorilla maintains 90%+ accuracy
* GPT-4 fails on version-changed APIs due to stale parametric knowledge from training data
===== Architecture =====
The retriever-aware training objective combines language modeling loss with retrieval relevance:
\mathcal{L} = \mathcal{L}_{LM}(y | x, d^*) + \lambda \mathcal{L}_{ret}(d^* | x)
where x is the user query, d^* is the retrieved documentation, y is the target API call, and \lambda balances the retrieval and generation losses.
^ Component ^ Details ^
| Base Model | LLaMA-7B (decoder-only transformer) |
| Retriever | Contriever bi-encoder, 768-dim embeddings |
| Input Format | '' Assistant: {"api_call": ...}'' |
| Context Length | 4K tokens |
| Training Data | ~10K query-doc pairs + 162K API documents |
===== References =====
* [[https://arxiv.org/abs/2305.15334|Patil et al. "Gorilla: Large Language Model Connected with Massive APIs" (arXiv:2305.15334)]]
* [[https://github.com/ShishirPatil/gorilla|Gorilla GitHub Repository]]
* [[https://gorilla.cs.berkeley.edu/|Gorilla Project Website (UC Berkeley)]]
===== See Also =====
* [[swe_agent|SWE-agent — Agent-Computer Interface for tool use in software engineering]]
* [[agent_distillation|Agent Distillation — Compressing tool-using agents into smaller models]]
* [[voyager|Voyager — Code-generating embodied agent with skill library]]