Table of Contents

Parallel Function Calling

Parallel function calling enables LLMs to generate and invoke multiple tool calls simultaneously in a single response, reducing latency compared to sequential execution. When independent tool calls are needed (e.g., checking weather and time for the same city), parallel execution can reduce latency from seconds to milliseconds.

The Sequential Bottleneck

Traditional function calling follows a strict serial pattern:

  1. LLM generates one tool call
  2. System executes the tool and returns result
  3. LLM generates next tool call (or final answer)
  4. Repeat for each needed tool

For N independent tool calls each taking L seconds, sequential execution costs N x L seconds. Parallel execution reduces this to max(L_1, …, L_N) – a dramatic improvement when calls are independent.

How Providers Implement It

OpenAI

OpenAI's GPT-4 and GPT-4o support parallel tool calls via the parallel_tool_calls=True API parameter. The model outputs multiple tool_calls in a single response message:

# OpenAI parallel function calling
import openai
 
tools = [
    {"type": "function", "function": {
        "name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}},
    {"type": "function", "function": {
        "name": "get_time", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}}
]
 
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Weather and time in Tokyo?"}],
    tools=tools,
    parallel_tool_calls=True  # Enable parallel calling
)
 
# Response contains multiple tool_calls in one message
for tool_call in response.choices[0].message.tool_calls:
    # Execute each call concurrently
    print(f"{tool_call.function.name}({tool_call.function.arguments})")

Anthropic

Anthropic's Claude models support parallel tool use natively. When the model determines multiple independent tools are needed, it emits multiple tool_use blocks in a single response. The client executes all calls concurrently and returns all results in the next message.

NVIDIA NIM

Models served via NVIDIA NIM (e.g., Mistral-7B-Instruct) support parallel tool calls through the same OpenAI-compatible API pattern.

SimpleTool (arXiv:2603.00030)

SimpleTool investigates the design space of parallel function calling, analyzing how LLMs learn to emit multiple tool calls and the failure modes that arise:

Parallel Decoding Strategies

Several frameworks optimize how parallel tool calls are generated and orchestrated:

LLMCompiler (ICML 2024)

Models data relations (def-use chains) and control dependencies (mutual exclusion) between tool calls:

LLMOrch (arXiv:2504.14872)

Extends LLMCompiler with processor load balancing:

LLM-Tool Compiler (arXiv:2405.17438)

Uses selective fusion to group similar tool operations at runtime:

Batched Execution

In a single LLM generation, models output tool calls as an array. The execution layer:

  1. Receives the full array of tool calls
  2. Identifies independent calls (no data dependencies)
  3. Executes independent calls concurrently (thread pool, async I/O)
  4. Waits for all to complete
  5. Returns all results to the LLM in one message

This eliminates round-trips: instead of N sequential LLM-tool-LLM cycles, one cycle handles all N calls.

Structured Extraction

LangChain and similar frameworks leverage parallel function calling for structured extraction – extracting multiple entity types from text in parallel:

Challenges and Limitations

References

See Also