====== How to Speed Up Agents ======
Agent latency directly impacts user experience and throughput. Production systems achieve **50-80% latency reductions** by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks.(([[https://blog.langchain.com/how-do-i-speed-up-my-agent/|How Do I Speed Up My Agent?]]))
===== Why Agent Latency Matters =====
A typical agent loop involves multiple LLM calls, tool executions, and reasoning steps. A single query can trigger 3-8 sequential LLM calls, each taking 1-5 seconds. Without optimization, end-to-end response times reach 15-40 seconds -- well beyond user tolerance thresholds.
===== The Agent Latency Stack =====
graph TB
subgraph Application
A1[Parallel Tool Calls]
A2[Streaming Responses]
A3[Speculative Execution]
end
subgraph Serving
B1[vLLM / TGI / SGLang](([[https://arxiv.org/abs/2511.17593|Comparative Analysis: vLLM vs HuggingFace TGI]]))
B2[Continuous Batching]
B3[KV Cache Reuse]
end
subgraph Model
C1[Smaller Models for Subtasks]
C2[Speculative Decoding]
C3[Quantization]
end
subgraph Infrastructure
D1[GPU Selection]
D2[Edge Deployment]
D3[Connection Pooling]
end
Application --> Serving
Serving --> Model
Model --> Infrastructure
===== Technique 1: Parallel Tool Execution =====
The single biggest latency win for agents. Instead of executing tools sequentially, run independent calls concurrently.(([[https://langcopilot.com/posts/2025-10-17-why-ai-agents-fail-latency-planning|Why AI Agents Fail: Latency]]))
**Measured impact:** >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools.(([[https://georgian.io/reduce-llm-costs-and-latency-guide|Reduce LLM Costs and Latency Guide]]))
import asyncio
import time
from typing import Any
class ParallelToolExecutor:
def __init__(self, tools: dict):
self.tools = tools
async def execute_parallel(self, tool_calls: list[dict]) -> list[Any]:
tasks = []
for call in tool_calls:
tool_fn = self.tools[call["name"]]
tasks.append(asyncio.create_task(tool_fn(**call["args"])))
start = time.monotonic()
results = await asyncio.gather(*tasks, return_exceptions=True)
elapsed = time.monotonic() - start
print(f"Parallel execution: {elapsed:.2f}s")
return results
# Example: 3 tools each taking 2s -> 2s parallel vs 6s sequential
async def search_web(query: str):
await asyncio.sleep(2)
return f"Results for: {query}"
async def query_database(sql: str):
await asyncio.sleep(2)
return f"DB results for: {sql}"
async def fetch_weather(city: str):
await asyncio.sleep(2)
return f"Weather in: {city}"
executor = ParallelToolExecutor({
"search_web": search_web,
"query_database": query_database,
"fetch_weather": fetch_weather,
})
# All three execute in ~2s instead of ~6s (3x speedup)
===== Technique 2: Streaming Responses =====
Streaming dramatically reduces **perceived latency** -- users see the first token in 200-500ms instead of waiting 3-10 seconds for the full response.
* Time-to-first-token (TTFT): typically 200-500ms with streaming
* Without streaming: users wait for full generation (3-30s depending on output length)
* Intermediate step display (like Perplexity) further improves perceived speed
* Streaming does not reduce total generation time, only perceived wait
===== Technique 3: Optimized Inference Serving =====
Self-hosting with optimized serving engines delivers major throughput and latency gains.
**vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):**(([[https://vllm.readthedocs.io/|vLLM Documentation]]))
^ Engine ^ Throughput (tok/s) ^ TTFT (ms) ^ Key Feature ^
| Naive PyTorch | 15-20 | 800-1200 | No optimization |
| HuggingFace TGI | 35-45 | 300-500 | Continuous batching |
| vLLM | 55-65 | 200-400 | PagedAttention + continuous batching |
| SGLang | 60-70 | 180-350 | RadixAttention + compiled graphs |
| TensorRT-LLM | 70-90 | 150-300 | Kernel fusion, NVIDIA-optimized |
//Source: MLPerf Inference Benchmark 2025, arXiv:2511.17593//
**Key optimizations in vLLM:**
* **PagedAttention:** Manages KV cache like virtual memory pages, eliminating waste. Enables 2-4x more concurrent requests.
* **Continuous batching:** New requests join the batch without waiting for current batch to finish.
* **Prefix caching:** Reuses KV cache for shared prompt prefixes across requests.
# vLLM server launch with optimizations
# python3 -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --enable-prefix-caching \
# --max-num-seqs 256 \
# --gpu-memory-utilization 0.90 \
# --dtype auto \
# --tensor-parallel-size 1
# Client usage - drop-in OpenAI compatible
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# Streaming for lowest perceived latency
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
===== Technique 4: Smaller Models for Subtasks =====
Not every agent step requires a frontier model. Route subtasks to smaller, faster models:
^ Task ^ Recommended Model ^ Latency ^ Notes ^
| Intent classification | Fine-tuned BERT / Haiku | 10-50ms | Simple classification |
| Entity extraction | GPT-4o-mini / Gemini Flash | 100-300ms | Structured output |
| Summarization | GPT-4o-mini | 200-500ms | Good enough quality |
| Complex reasoning | GPT-4o / Claude Sonnet | 1-5s | Only when needed |
| Code generation | Claude Sonnet / GPT-4o | 2-8s | Accuracy critical |
===== Technique 5: Speculative Decoding =====
Use a small draft model to predict tokens, then verify in batch with the large model. Achieves **2-3x speedup** on autoregressive generation without quality loss.
How it works:
- Draft model generates N candidate tokens (fast, ~50ms)
- Target model verifies all N tokens in a single forward pass
- Accepted tokens are kept; rejected tokens trigger re-generation
- Net effect: multiple tokens per forward pass of the large model
===== Technique 6: KV Cache Optimization =====
* **Prefix caching:** When multiple requests share a system prompt, cache the KV states. Saves recomputation on every request.
* **KV cache quantization:** Compress cache to FP8 or INT8, reducing memory 2-4x and enabling more concurrent requests.
* **Paged KV cache:** vLLM allocates cache in pages instead of contiguous blocks, reducing waste from 60-80% to under 4%.
===== Technique 7: Batching Strategies =====
^ Strategy ^ Throughput Gain ^ Latency Impact ^ Use Case ^
| Static batching | 2-4x | Increases (waits for batch) | Offline processing |
| Continuous batching | 3-5x | Minimal increase | Real-time serving |
| Dynamic batching | 2-3x | Configurable max wait | Mixed workloads |
===== End-to-End Optimization Pipeline =====
graph LR
A[User Query] --> B[Route to Model Tier]
B --> C{Needs tools?}
C -->|Yes| D[Plan Tool Calls]
D --> E[Execute in Parallel]
E --> F[Stream Results]
C -->|No| F
F --> G[Speculative Decode]
G --> H[Stream to User]
===== Production Optimization Checklist =====
* **Quick wins (under 1 day):** Enable streaming, set max_tokens, use smaller models for subtasks
* **Medium effort (1 week):** Implement parallel tool execution, add semantic caching
* **Infrastructure (2-4 weeks):** Deploy vLLM/SGLang, enable prefix caching, set up model routing
===== See Also =====
* [[how_to_reduce_token_costs|How to Reduce Token Costs]]
* [[caching_strategies_for_agents|Caching Strategies for Agents]]
* [[what_is_an_ai_agent|What is an AI Agent]]
===== References =====