====== How to Speed Up Agents ======

Agent latency directly impacts user experience and throughput. Production systems achieve **50-80% latency reductions** by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks.(([[https://blog.langchain.com/how-do-i-speed-up-my-agent/|How Do I Speed Up My Agent?]]))

===== Why Agent Latency Matters =====

A typical agent loop involves multiple LLM calls, tool executions, and reasoning steps. A single query can trigger 3-8 sequential LLM calls, each taking 1-5 seconds. Without optimization, end-to-end response times reach 15-40 seconds -- well beyond user tolerance thresholds.

===== The Agent Latency Stack =====

<mermaid>
graph TB
    subgraph Application
        A1[Parallel Tool Calls]
        A2[Streaming Responses]
        A3[Speculative Execution]
    end
    subgraph Serving
        B1[vLLM / TGI / SGLang](([[https://arxiv.org/abs/2511.17593|Comparative Analysis: vLLM vs HuggingFace TGI]]))
        B2[Continuous Batching]
        B3[KV Cache Reuse]
    end
    subgraph Model
        C1[Smaller Models for Subtasks]
        C2[Speculative Decoding]
        C3[Quantization]
    end
    subgraph Infrastructure
        D1[GPU Selection]
        D2[Edge Deployment]
        D3[Connection Pooling]
    end
    Application --> Serving
    Serving --> Model
    Model --> Infrastructure
</mermaid>

===== Technique 1: Parallel Tool Execution =====

The single biggest latency win for agents. Instead of executing tools sequentially, run independent calls concurrently.(([[https://langcopilot.com/posts/2025-10-17-why-ai-agents-fail-latency-planning|Why AI Agents Fail: Latency]]))

**Measured impact:** >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools.(([[https://georgian.io/reduce-llm-costs-and-latency-guide|Reduce LLM Costs and Latency Guide]]))

<code python>
import asyncio
import time
from typing import Any

class ParallelToolExecutor:
    def __init__(self, tools: dict):
        self.tools = tools

    async def execute_parallel(self, tool_calls: list[dict]) -> list[Any]:
        tasks = []
        for call in tool_calls:
            tool_fn = self.tools[call["name"]]
            tasks.append(asyncio.create_task(tool_fn(**call["args"])))

        start = time.monotonic()
        results = await asyncio.gather(*tasks, return_exceptions=True)
        elapsed = time.monotonic() - start

        print(f"Parallel execution: {elapsed:.2f}s")
        return results

# Example: 3 tools each taking 2s -> 2s parallel vs 6s sequential
async def search_web(query: str):
    await asyncio.sleep(2)
    return f"Results for: {query}"

async def query_database(sql: str):
    await asyncio.sleep(2)
    return f"DB results for: {sql}"

async def fetch_weather(city: str):
    await asyncio.sleep(2)
    return f"Weather in: {city}"

executor = ParallelToolExecutor({
    "search_web": search_web,
    "query_database": query_database,
    "fetch_weather": fetch_weather,
})
# All three execute in ~2s instead of ~6s (3x speedup)
</code>

===== Technique 2: Streaming Responses =====

Streaming dramatically reduces **perceived latency** -- users see the first token in 200-500ms instead of waiting 3-10 seconds for the full response.

  * Time-to-first-token (TTFT): typically 200-500ms with streaming
  * Without streaming: users wait for full generation (3-30s depending on output length)
  * Intermediate step display (like Perplexity) further improves perceived speed
  * Streaming does not reduce total generation time, only perceived wait

===== Technique 3: Optimized Inference Serving =====

Self-hosting with optimized serving engines delivers major throughput and latency gains.

**vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):**(([[https://vllm.readthedocs.io/|vLLM Documentation]]))

^ Engine ^ Throughput (tok/s) ^ TTFT (ms) ^ Key Feature ^
| Naive PyTorch | 15-20 | 800-1200 | No optimization |
| HuggingFace TGI | 35-45 | 300-500 | Continuous batching |
| vLLM | 55-65 | 200-400 | PagedAttention + continuous batching |
| SGLang | 60-70 | 180-350 | RadixAttention + compiled graphs |
| TensorRT-LLM | 70-90 | 150-300 | Kernel fusion, NVIDIA-optimized |

//Source: MLPerf Inference Benchmark 2025, arXiv:2511.17593//

**Key optimizations in vLLM:**
  * **PagedAttention:** Manages KV cache like virtual memory pages, eliminating waste. Enables 2-4x more concurrent requests.
  * **Continuous batching:** New requests join the batch without waiting for current batch to finish.
  * **Prefix caching:** Reuses KV cache for shared prompt prefixes across requests.

<code python>
# vLLM server launch with optimizations
# python3 -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --enable-prefix-caching \
#     --max-num-seqs 256 \
#     --gpu-memory-utilization 0.90 \
#     --dtype auto \
#     --tensor-parallel-size 1

# Client usage - drop-in OpenAI compatible
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# Streaming for lowest perceived latency
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
</code>

===== Technique 4: Smaller Models for Subtasks =====

Not every agent step requires a frontier model. Route subtasks to smaller, faster models:

^ Task ^ Recommended Model ^ Latency ^ Notes ^
| Intent classification | Fine-tuned BERT / Haiku | 10-50ms | Simple classification |
| Entity extraction | GPT-4o-mini / Gemini Flash | 100-300ms | Structured output |
| Summarization | GPT-4o-mini | 200-500ms | Good enough quality |
| Complex reasoning | GPT-4o / Claude Sonnet | 1-5s | Only when needed |
| Code generation | Claude Sonnet / GPT-4o | 2-8s | Accuracy critical |

===== Technique 5: Speculative Decoding =====

Use a small draft model to predict tokens, then verify in batch with the large model. Achieves **2-3x speedup** on autoregressive generation without quality loss.

How it works:
  - Draft model generates N candidate tokens (fast, ~50ms)
  - Target model verifies all N tokens in a single forward pass
  - Accepted tokens are kept; rejected tokens trigger re-generation
  - Net effect: multiple tokens per forward pass of the large model

===== Technique 6: KV Cache Optimization =====

  * **Prefix caching:** When multiple requests share a system prompt, cache the KV states. Saves recomputation on every request.
  * **KV cache quantization:** Compress cache to FP8 or INT8, reducing memory 2-4x and enabling more concurrent requests.
  * **Paged KV cache:** vLLM allocates cache in pages instead of contiguous blocks, reducing waste from 60-80% to under 4%.

===== Technique 7: Batching Strategies =====

^ Strategy ^ Throughput Gain ^ Latency Impact ^ Use Case ^
| Static batching | 2-4x | Increases (waits for batch) | Offline processing |
| Continuous batching | 3-5x | Minimal increase | Real-time serving |
| Dynamic batching | 2-3x | Configurable max wait | Mixed workloads |

===== End-to-End Optimization Pipeline =====

<mermaid>
graph LR
    A[User Query] --> B[Route to Model Tier]
    B --> C{Needs tools?}
    C -->|Yes| D[Plan Tool Calls]
    D --> E[Execute in Parallel]
    E --> F[Stream Results]
    C -->|No| F
    F --> G[Speculative Decode]
    G --> H[Stream to User]
</mermaid>

===== Production Optimization Checklist =====

  * **Quick wins (under 1 day):** Enable streaming, set max_tokens, use smaller models for subtasks
  * **Medium effort (1 week):** Implement parallel tool execution, add semantic caching
  * **Infrastructure (2-4 weeks):** Deploy vLLM/SGLang, enable prefix caching, set up model routing

===== See Also =====

  * [[how_to_reduce_token_costs|How to Reduce Token Costs]]
  * [[caching_strategies_for_agents|Caching Strategies for Agents]]
  * [[what_is_an_ai_agent|What is an AI Agent]]

===== References =====