The Latency Problem
Pattern-Aware Speculation
Scheduling Architecture
When to Speculate
Rollback on Misprediction
Latency Reduction Results
Extensions and Future Work (2025-2026)
See Also
References

Speculative Tool Execution

Speculative tool execution predicts and pre-executes likely future tool calls during LLM thinking time, hiding latency in the iterative LLM-tool loop. The key framework is PASTE (Pattern-Aware Speculative Tool Execution), introduced in arXiv:2603.18897, which reduces average task completion time by 48.5% and boosts tool throughput by 1.8x.¹⁾

The Latency Problem

LLM agents follow a serial loop:

LLM generates a tool call (or final answer)
Tool executes and returns results
Results feed back into context for next LLM generation
Repeat

Each iteration pays both LLM inference latency and tool execution latency sequentially. For agents making dozens of tool calls per task (e.g., SWE-bench coding, web browsing), this compounds into significant wall-clock time.

Pattern-Aware Speculation

PASTE exploits observed patterns in agent behavior to predict future tool calls. A speculative pattern P = (C, T, f, p) consists of:

Context C: Order-preserving subsequence of prior event metadata (tool types, status codes), ignoring variable payloads like query strings
Tool type T: Predicted next tool (e.g., “download_webpage”, “run_tests”)
Parameter function f: Derives predicted arguments from prior tool outputs or prompts
Priority p: Utility score balancing prediction likelihood, latency savings, and resource cost

The pattern predictor uses hash-map lookups on context keys for constant-time candidate generation, achieving high Top-1 accuracy and Top-3 recall.

# Simplified speculative tool execution
class SpeculativeExecutor:
    def __init__(self, pattern_db, tool_registry):
        self.patterns = pattern_db
        self.tools = tool_registry
        self.cache = {}  # key: (tool_name, args_hash) -> result
 
    def on_tool_complete(self, context, result):
        """After a tool completes, speculatively start predicted next tools."""
        context.append(result.metadata)
        predictions = self.patterns.predict(context)
 
        for tool_name, args, priority in predictions:
            cache_key = (tool_name, hash(args))
            if cache_key not in self.cache:
                # Launch speculative execution in background
                self.cachecache_key = self.tools.execute_async(
                    tool_name, args, preemptible=True
                )
 
    def execute(self, tool_name, args):
        """Execute a tool, reusing speculative result if available."""
        cache_key = (tool_name, hash(args))
        if cache_key in self.cache:
            return self.cachecache_key.promote()  # reuse or promote
        return self.tools.execute(tool_name, args)

Scheduling Architecture

PASTE maintains two execution queues sharing a result cache:

Active queue: Authoritative (real) tool calls, non-preemptible
Shadow queue: Speculative tool calls, preemptible, ranked by utility

Opportunistic scheduling: Speculative jobs run greedily on slack resources. On contention, low-utility speculative jobs are preempted to prioritize real calls.

Promotion and reuse: When the LLM issues a real tool call:

Check cache for matching speculative result
If complete: reuse immediately (zero latency)
If in-progress: promote to non-preemptible authoritative execution
If no match: execute normally

When to Speculate

Speculate when:

Strong temporal locality: Recurring tool sequences (e.g., search → fetch → parse)
Data-flow dependencies: Parameters derivable from prior outputs
High utility: Likely consumption x latency benefit > execution cost

Wait when:

Low-confidence predictions (unmatched context patterns)
Expensive tools with side effects (database writes, API calls with rate limits)
Resource contention (all workers busy with real tasks)

Rollback on Misprediction

Incorrect speculations are handled safely:

Speculative results are cached but only committed when a matching authoritative call arrives
Preemptible jobs are aborted without impact on agent state
No state pollution, tools append to session only upon verification
The serial loop's determinism is fully preserved

This is analogous to CPU branch prediction: speculate optimistically, discard on mispredict, with no correctness impact.

Latency Reduction Results

Benchmarked on SWE-bench and MetaGPT agent workloads:

48.5% reduction in average task completion time²⁾
1.8x improvement in tool throughput
Minimal resource overhead (speculative jobs use slack capacity)
Zero impact on correctness (all results verified before use)

Extensions and Future Work (2025-2026)

Parallel speculative drafting: Predict multiple tool candidates simultaneously (tree-based, like speculative decoding for tokens)
Adaptive termination: Probability thresholds to optimize speculation depth
Combined token + tool speculation: Speculatively decode tokens AND pre-execute tools simultaneously for compounded latency gains
Integration with agentic inference libraries: 2x faster execution via unified speculation frameworks

References

¹⁾ , ²⁾

PASTE: Pattern-Aware Speculative Tool Execution. arXiv:2603.18897.

Table of Contents