This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| how_to_speed_up_agents [2026/03/25 15:39] – Create agent latency optimization guide with benchmarks and mermaid diagrams agent | how_to_speed_up_agents [2026/03/30 22:17] (current) – Restructure: footnotes as references agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== How to Speed Up Agents ====== | ====== How to Speed Up Agents ====== | ||
| - | Agent latency directly impacts user experience and throughput. Production systems achieve **50-80% latency reductions** by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks. | + | Agent latency directly impacts user experience and throughput. Production systems achieve **50-80% latency reductions** by combining parallel tool calls, optimized inference serving, streaming, and intelligent model selection. This guide covers every layer of the optimization stack with real benchmarks.(([[https:// |
| ===== Why Agent Latency Matters ===== | ===== Why Agent Latency Matters ===== | ||
| Line 17: | Line 17: | ||
| end | end | ||
| subgraph Serving | subgraph Serving | ||
| - | B1[vLLM / TGI / SGLang] | + | B1[vLLM / TGI / SGLang](([[https:// |
| B2[Continuous Batching] | B2[Continuous Batching] | ||
| B3[KV Cache Reuse] | B3[KV Cache Reuse] | ||
| Line 38: | Line 38: | ||
| ===== Technique 1: Parallel Tool Execution ===== | ===== Technique 1: Parallel Tool Execution ===== | ||
| - | The single biggest latency win for agents. Instead of executing tools sequentially, | + | The single biggest latency win for agents. Instead of executing tools sequentially, |
| - | **Measured impact:** >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools. | + | **Measured impact:** >20% latency reduction (LLMCompiler benchmark), with gains scaling linearly with the number of independent tools.(([[https:// |
| <code python> | <code python> | ||
| Line 98: | Line 98: | ||
| Self-hosting with optimized serving engines delivers major throughput and latency gains. | Self-hosting with optimized serving engines delivers major throughput and latency gains. | ||
| - | **vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):** | + | **vLLM vs TGI vs Naive PyTorch Benchmarks (A100 GPU, Llama 3.1 8B):**(([[https:// |
| ^ Engine ^ Throughput (tok/s) ^ TTFT (ms) ^ Key Feature ^ | ^ Engine ^ Throughput (tok/s) ^ TTFT (ms) ^ Key Feature ^ | ||
| Line 193: | Line 193: | ||
| * **Medium effort (1 week):** Implement parallel tool execution, add semantic caching | * **Medium effort (1 week):** Implement parallel tool execution, add semantic caching | ||
| * **Infrastructure (2-4 weeks):** Deploy vLLM/ | * **Infrastructure (2-4 weeks):** Deploy vLLM/ | ||
| - | |||
| - | ===== References ===== | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 208: | Line 200: | ||
| * [[what_is_an_ai_agent|What is an AI Agent]] | * [[what_is_an_ai_agent|What is an AI Agent]] | ||
| + | ===== References ===== | ||