Agent Threat Modeling
Agent threat modeling is the systematic analysis of security vulnerabilities in LLM-based autonomous agents. As agents gain capabilities to execute code, access tools, and interact with external systems, they introduce novel attack surfaces that extend far beyond traditional prompt injection. The OWASP Top 10 for Agentic Applications (2026) and research by Schneier et al. frame these as multi-stage “Promptware Kill Chains” that hijack planning, tools, and propagation across systems.
Prompt Injection Chains
In agentic systems, prompt injections evolve from isolated manipulations into coordinated multi-tool, multi-step attacks:
Direct injection — Malicious instructions embedded in user inputs that subvert agent behavior
Indirect injection — Commands hidden in external data sources (documents,
API responses, emails, web pages) that agents process without adequate validation
Multi-chain injection — “Russian doll” attacks where nested injections propagate across multiple LLM chains in a workflow, each injection activating the next
Memory poisoning — Injections that persist in agent memory or conversation history, affecting future interactions
Recency bias exploitation — Adversarial instructions placed late in context windows to override earlier legitimate instructions
The Promptware Kill Chain (Schneier et al., 2026) models five stages of agentic prompt injection attacks:
Initial Access — Injection via user input, poisoned RAG data, emails, or web content
Privilege Escalation — Exploiting agent tool permissions to gain broader system access
Execution — Triggering unintended tool calls, code execution, or data modifications
Persistence — Embedding malicious instructions in agent memory or external stores
Propagation — Spreading compromised instructions to other agents or downstream systems
Agents inherit user privileges for tools, creating dangerous attack vectors:
Excessive agency — Over-privileged agents with more tool access than tasks require, enabling injected prompts to trigger actions like remote code execution, SSRF, or SQL injection
Tool chaining attacks — Attackers exploit the agent's planning capability to sequence legitimate tool calls in harmful ways (e.g., read credentials then exfiltrate via HTTP)
Plugin vulnerabilities — Third-party tool integrations (LangChain plugins,
API connectors) that lack input validation or have their own security flaws
Iterative refinement — Agents that retry and adjust tool calls may be manipulated into gradually escalating harmful behavior across multiple turns
Data Exfiltration
Compromised agents can leak sensitive data through multiple channels:
Direct exfiltration — Using network tools to send data to attacker-controlled endpoints
Steganographic leakage — Encoding secrets into seemingly innocuous agent outputs (e.g., markdown images with data in URLs)
Inter-agent propagation — In multi-agent systems, compromised agents passing sensitive data to other agents that have external communication capabilities
Side-channel leakage — Information leaking through timing, error messages, or behavioral patterns
Supply-Chain Attacks
Agent supply chains introduce multiple points of compromise:
Poisoned tool descriptions — Malicious instructions embedded in tool/function documentation that agents read during planning
Compromised RAG corpora — Adversarial content injected into retrieval databases that agents consult for knowledge
Malicious API responses — Third-party APIs returning crafted responses designed to hijack agent behavior
Model supply chain — Backdoored fine-tuned models or adapters that activate under specific conditions
Inter-agent message poisoning — Compromised agents in multi-agent systems sending malicious instructions disguised as legitimate coordination
Mitigations
Defense-in-depth strategies for securing LLM agents:
Input/Output Validation:
Sanitize all input sources: prompts, RAG results, tool outputs,
API responses, inter-agent messages
Deploy classifiers (e.g., LLM Guard) to detect injection attempts
Apply semantic checks for instruction-like content in data fields
Enforce length, format, and content-type constraints
Tool Sandboxing and Privilege Minimization:
Grant least-privilege access — agents should only access tools needed for the current task
Validate all tool calls before execution against an allowlist of permitted operations
Implement resource quotas and rate limiting on tool usage
Goal-Lock and Human-in-the-Loop:
Enforce immutable task goals that cannot be overridden by injected instructions
Require human approval for high-impact actions (financial transactions, data deletion, credential access)
Implement “break glass” kill switches for emergency agent termination
Monitoring and Detection:
Continuous behavioral monitoring for anomalous tool usage patterns
Multi-chain injection detectors that identify coordinated attack sequences
Provenance tracking for all data flowing through agent systems
# Example: Agent threat detection middleware
class AgentSecurityMiddleware:
def __init__(self, policy):
self.policy = policy
self.injection_detector = InjectionClassifier()
self.anomaly_detector = BehaviorAnomalyDetector()
def validate_tool_call(self, agent_id, tool_name, arguments):
"""Validate a tool call before execution."""
# Check tool is in agent's allowlist
if tool_name not in self.policy.allowed_tools(agent_id):
raise SecurityViolation(f"Unauthorized tool: {tool_name}")
# Scan arguments for injection attempts
if self.injection_detector.scan(str(arguments)):
raise SecurityViolation("Potential injection in tool args")
# Check for anomalous behavior patterns
if self.anomaly_detector.is_anomalous(agent_id, tool_name):
self.escalate_to_human(agent_id, tool_name, arguments)
return True # Allow execution
References
See Also