====== How to Speed Up Agents ====== Agent latency directly impacts user experience and throughput. Production systems achieve **50-80% latency reductions** by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks.(([[https://blog.langchain.com/how-do-i-speed-up-my-agent/|How Do I Speed Up My Agent?]])) ===== Why Agent Latency Matters ===== A typical agent loop involves multiple LLM calls, tool executions, and reasoning steps. A single query can trigger 3-8 sequential LLM calls, each taking 1-5 seconds. Without optimization, end-to-end response times reach 15-40 seconds -- well beyond user tolerance thresholds. ===== The Agent Latency Stack ===== graph TB subgraph Application A1[Parallel Tool Calls] A2[Streaming Responses] A3[Speculative Execution] end subgraph Serving B1[vLLM / TGI / SGLang](([[https://arxiv.org/abs/2511.17593|Comparative Analysis: vLLM vs HuggingFace TGI]])) B2[Continuous Batching] B3[KV Cache Reuse] end subgraph Model C1[Smaller Models for Subtasks] C2[Speculative Decoding] C3[Quantization] end subgraph Infrastructure D1[GPU Selection] D2[Edge Deployment] D3[Connection Pooling] end Application --> Serving Serving --> Model Model --> Infrastructure ===== Technique 1: Parallel Tool Execution ===== The single biggest latency win for agents. Instead of executing tools sequentially, run independent calls concurrently.(([[https://langcopilot.com/posts/2025-10-17-why-ai-agents-fail-latency-planning|Why AI Agents Fail: Latency]])) **Measured impact:** >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools.(([[https://georgian.io/reduce-llm-costs-and-latency-guide|Reduce LLM Costs and Latency Guide]])) import asyncio import time from typing import Any class ParallelToolExecutor: def __init__(self, tools: dict): self.tools = tools async def execute_parallel(self, tool_calls: list[dict]) -> list[Any]: tasks = [] for call in tool_calls: tool_fn = self.tools[call["name"]] tasks.append(asyncio.create_task(tool_fn(**call["args"]))) start = time.monotonic() results = await asyncio.gather(*tasks, return_exceptions=True) elapsed = time.monotonic() - start print(f"Parallel execution: {elapsed:.2f}s") return results # Example: 3 tools each taking 2s -> 2s parallel vs 6s sequential async def search_web(query: str): await asyncio.sleep(2) return f"Results for: {query}" async def query_database(sql: str): await asyncio.sleep(2) return f"DB results for: {sql}" async def fetch_weather(city: str): await asyncio.sleep(2) return f"Weather in: {city}" executor = ParallelToolExecutor({ "search_web": search_web, "query_database": query_database, "fetch_weather": fetch_weather, }) # All three execute in ~2s instead of ~6s (3x speedup) ===== Technique 2: Streaming Responses ===== Streaming dramatically reduces **perceived latency** -- users see the first token in 200-500ms instead of waiting 3-10 seconds for the full response. * Time-to-first-token (TTFT): typically 200-500ms with streaming * Without streaming: users wait for full generation (3-30s depending on output length) * Intermediate step display (like Perplexity) further improves perceived speed * Streaming does not reduce total generation time, only perceived wait ===== Technique 3: Optimized Inference Serving ===== Self-hosting with optimized serving engines delivers major throughput and latency gains. **vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):**(([[https://vllm.readthedocs.io/|vLLM Documentation]])) ^ Engine ^ Throughput (tok/s) ^ TTFT (ms) ^ Key Feature ^ | Naive PyTorch | 15-20 | 800-1200 | No optimization | | HuggingFace TGI | 35-45 | 300-500 | Continuous batching | | vLLM | 55-65 | 200-400 | PagedAttention + continuous batching | | SGLang | 60-70 | 180-350 | RadixAttention + compiled graphs | | TensorRT-LLM | 70-90 | 150-300 | Kernel fusion, NVIDIA-optimized | //Source: MLPerf Inference Benchmark 2025, arXiv:2511.17593// **Key optimizations in vLLM:** * **PagedAttention:** Manages KV cache like virtual memory pages, eliminating waste. Enables 2-4x more concurrent requests. * **Continuous batching:** New requests join the batch without waiting for current batch to finish. * **Prefix caching:** Reuses KV cache for shared prompt prefixes across requests. # vLLM server launch with optimizations # python3 -m vllm.entrypoints.openai.api_server \ # --model meta-llama/Llama-3.1-8B-Instruct \ # --enable-prefix-caching \ # --max-num-seqs 256 \ # --gpu-memory-utilization 0.90 \ # --dtype auto \ # --tensor-parallel-size 1 # Client usage - drop-in OpenAI compatible import openai client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="unused") # Streaming for lowest perceived latency stream = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[{"role": "user", "content": "Explain PagedAttention"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) ===== Technique 4: Smaller Models for Subtasks ===== Not every agent step requires a frontier model. Route subtasks to smaller, faster models: ^ Task ^ Recommended Model ^ Latency ^ Notes ^ | Intent classification | Fine-tuned BERT / Haiku | 10-50ms | Simple classification | | Entity extraction | GPT-4o-mini / Gemini Flash | 100-300ms | Structured output | | Summarization | GPT-4o-mini | 200-500ms | Good enough quality | | Complex reasoning | GPT-4o / Claude Sonnet | 1-5s | Only when needed | | Code generation | Claude Sonnet / GPT-4o | 2-8s | Accuracy critical | ===== Technique 5: Speculative Decoding ===== Use a small draft model to predict tokens, then verify in batch with the large model. Achieves **2-3x speedup** on autoregressive generation without quality loss. How it works: - Draft model generates N candidate tokens (fast, ~50ms) - Target model verifies all N tokens in a single forward pass - Accepted tokens are kept; rejected tokens trigger re-generation - Net effect: multiple tokens per forward pass of the large model ===== Technique 6: KV Cache Optimization ===== * **Prefix caching:** When multiple requests share a system prompt, cache the KV states. Saves recomputation on every request. * **KV cache quantization:** Compress cache to FP8 or INT8, reducing memory 2-4x and enabling more concurrent requests. * **Paged KV cache:** vLLM allocates cache in pages instead of contiguous blocks, reducing waste from 60-80% to under 4%. ===== Technique 7: Batching Strategies ===== ^ Strategy ^ Throughput Gain ^ Latency Impact ^ Use Case ^ | Static batching | 2-4x | Increases (waits for batch) | Offline processing | | Continuous batching | 3-5x | Minimal increase | Real-time serving | | Dynamic batching | 2-3x | Configurable max wait | Mixed workloads | ===== End-to-End Optimization Pipeline ===== graph LR A[User Query] --> B[Route to Model Tier] B --> C{Needs tools?} C -->|Yes| D[Plan Tool Calls] D --> E[Execute in Parallel] E --> F[Stream Results] C -->|No| F F --> G[Speculative Decode] G --> H[Stream to User] ===== Production Optimization Checklist ===== * **Quick wins (under 1 day):** Enable streaming, set max_tokens, use smaller models for subtasks * **Medium effort (1 week):** Implement parallel tool execution, add semantic caching * **Infrastructure (2-4 weeks):** Deploy vLLM/SGLang, enable prefix caching, set up model routing ===== See Also ===== * [[how_to_reduce_token_costs|How to Reduce Token Costs]] * [[caching_strategies_for_agents|Caching Strategies for Agents]] * [[what_is_an_ai_agent|What is an AI Agent]] ===== References =====