Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Speculative tool execution predicts and pre-executes likely future tool calls during LLM thinking time, hiding latency in the iterative LLM-tool loop. The key framework is PASTE (Pattern-Aware Speculative Tool Execution), introduced in arXiv:2603.18897, which reduces average task completion time by 48.5% and boosts tool throughput by 1.8x.
LLM agents follow a serial loop:
Each iteration pays both LLM inference latency and tool execution latency sequentially. For agents making dozens of tool calls per task (e.g., SWE-bench coding, web browsing), this compounds into significant wall-clock time.
PASTE exploits observed patterns in agent behavior to predict future tool calls. A speculative pattern P = (C, T, f, p) consists of:
The pattern predictor uses hash-map lookups on context keys for constant-time candidate generation, achieving high Top-1 accuracy and Top-3 recall.
# Simplified speculative tool execution class SpeculativeExecutor: def __init__(self, pattern_db, tool_registry): self.patterns = pattern_db self.tools = tool_registry self.cache = {} # key: (tool_name, args_hash) -> result def on_tool_complete(self, context, result): """After a tool completes, speculatively start predicted next tools.""" context.append(result.metadata) predictions = self.patterns.predict(context) for tool_name, args, priority in predictions: cache_key = (tool_name, hash(args)) if cache_key not in self.cache: # Launch speculative execution in background self.cache[cache_key] = self.tools.execute_async( tool_name, args, preemptible=True ) def execute(self, tool_name, args): """Execute a tool, reusing speculative result if available.""" cache_key = (tool_name, hash(args)) if cache_key in self.cache: return self.cache[cache_key].promote() # reuse or promote return self.tools.execute(tool_name, args)
PASTE maintains two execution queues sharing a result cache:
Opportunistic scheduling: Speculative jobs run greedily on slack resources. On contention, low-utility speculative jobs are preempted to prioritize real calls.
Promotion and reuse: When the LLM issues a real tool call:
Speculate when:
Wait when:
Incorrect speculations are handled safely:
This is analogous to CPU branch prediction: speculate optimistically, discard on mispredict, with no correctness impact.
Benchmarked on SWE-bench and MetaGPT agent workloads: