Table of Contents

How to Speed Up Agents

Agent latency directly impacts user experience and throughput. Production systems achieve 50-80% latency reductions by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks.1)

Why Agent Latency Matters

A typical agent loop involves multiple LLM calls, tool executions, and reasoning steps. A single query can trigger 3-8 sequential LLM calls, each taking 1-5 seconds. Without optimization, end-to-end response times reach 15-40 seconds – well beyond user tolerance thresholds.

The Agent Latency Stack

graph TB subgraph Application A1[Parallel Tool Calls] A2[Streaming Responses] A3[Speculative Execution] end subgraph Serving B1[vLLM / TGI / SGLang]((<a href='https://arxiv.org/abs/2511.17593' class='urlextern' title='https://arxiv.org/abs/2511.17593' rel='ugc nofollow'>Comparative Analysis: vLLM vs HuggingFace TGI</a>)) B2[Continuous Batching] B3[KV Cache Reuse] end subgraph Model C1[Smaller Models for Subtasks] C2[Speculative Decoding] C3[Quantization] end subgraph Infrastructure D1[GPU Selection] D2[Edge Deployment] D3[Connection Pooling] end Application --> Serving Serving --> Model Model --> Infrastructure

Technique 1: Parallel Tool Execution

The single biggest latency win for agents. Instead of executing tools sequentially, run independent calls concurrently.2)

Measured impact: >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools.3)

import asyncio
import time
from typing import Any
 
class ParallelToolExecutor:
    def __init__(self, tools: dict):
        self.tools = tools
 
    async def execute_parallel(self, tool_calls: list[dict]) -> list[Any]:
        tasks = []
        for call in tool_calls:
            tool_fn = self.tools[call["name"]]
            tasks.append(asyncio.create_task(tool_fn(**call["args"])))
 
        start = time.monotonic()
        results = await asyncio.gather(*tasks, return_exceptions=True)
        elapsed = time.monotonic() - start
 
        print(f"Parallel execution: {elapsed:.2f}s")
        return results
 
# Example: 3 tools each taking 2s -> 2s parallel vs 6s sequential
async def search_web(query: str):
    await asyncio.sleep(2)
    return f"Results for: {query}"
 
async def query_database(sql: str):
    await asyncio.sleep(2)
    return f"DB results for: {sql}"
 
async def fetch_weather(city: str):
    await asyncio.sleep(2)
    return f"Weather in: {city}"
 
executor = ParallelToolExecutor({
    "search_web": search_web,
    "query_database": query_database,
    "fetch_weather": fetch_weather,
})
# All three execute in ~2s instead of ~6s (3x speedup)

Technique 2: Streaming Responses

Streaming dramatically reduces perceived latency – users see the first token in 200-500ms instead of waiting 3-10 seconds for the full response.

Technique 3: Optimized Inference Serving

Self-hosting with optimized serving engines delivers major throughput and latency gains.

vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):4)

Engine Throughput (tok/s) TTFT (ms) Key Feature
Naive PyTorch 15-20 800-1200 No optimization
HuggingFace TGI 35-45 300-500 Continuous batching
vLLM 55-65 200-400 PagedAttention + continuous batching
SGLang 60-70 180-350 RadixAttention + compiled graphs
TensorRT-LLM 70-90 150-300 Kernel fusion, NVIDIA-optimized

Source: MLPerf Inference Benchmark 2025, arXiv:2511.17593

Key optimizations in vLLM:

# vLLM server launch with optimizations
# python3 -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --enable-prefix-caching \
#     --max-num-seqs 256 \
#     --gpu-memory-utilization 0.90 \
#     --dtype auto \
#     --tensor-parallel-size 1
 
# Client usage - drop-in OpenAI compatible
import openai
 
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
 
# Streaming for lowest perceived latency
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Technique 4: Smaller Models for Subtasks

Not every agent step requires a frontier model. Route subtasks to smaller, faster models:

Task Recommended Model Latency Notes
Intent classification Fine-tuned BERT / Haiku 10-50ms Simple classification
Entity extraction GPT-4o-mini / Gemini Flash 100-300ms Structured output
Summarization GPT-4o-mini 200-500ms Good enough quality
Complex reasoning GPT-4o / Claude Sonnet 1-5s Only when needed
Code generation Claude Sonnet / GPT-4o 2-8s Accuracy critical

Technique 5: Speculative Decoding

Use a small draft model to predict tokens, then verify in batch with the large model. Achieves 2-3x speedup on autoregressive generation without quality loss.

How it works:

  1. Draft model generates N candidate tokens (fast, ~50ms)
  2. Target model verifies all N tokens in a single forward pass
  3. Accepted tokens are kept; rejected tokens trigger re-generation
  4. Net effect: multiple tokens per forward pass of the large model

Technique 6: KV Cache Optimization

Technique 7: Batching Strategies

Strategy Throughput Gain Latency Impact Use Case
Static batching 2-4x Increases (waits for batch) Offline processing
Continuous batching 3-5x Minimal increase Real-time serving
Dynamic batching 2-3x Configurable max wait Mixed workloads

End-to-End Optimization Pipeline

graph LR A[User Query] --> B[Route to Model Tier] B --> C{Needs tools?} C -->|Yes| D[Plan Tool Calls] D --> E[Execute in Parallel] E --> F[Stream Results] C -->|No| F F --> G[Speculative Decode] G --> H[Stream to User]

Production Optimization Checklist

See Also

References