Specific defenses against prompt injection attacks in AI agent systems, including input sanitization, output filtering, privilege separation, canary tokens, and instruction hierarchy.
Overview
Prompt injection is a critical vulnerability where adversaries embed malicious instructions in inputs that are processed by LLM-powered agents. Unlike traditional injection attacks (SQL, XSS), prompt injection exploits the fundamental inability of LLMs to distinguish between instructions and data. For agents with tool access, this creates a “Confused Deputy” problem where the agent executes attacker-controlled actions with its own privileges.
This page covers defensive techniques specific to agent systems. For broader security architecture, see threat modeling.
Attack Surface in Agent Systems
Agent-specific injection vectors include:
Direct injection – Malicious instructions in user input
Indirect injection – Hidden instructions in documents, web pages, emails, or tool outputs the agent processes
Multi-modal injection – Instructions embedded in images, audio, or other non-text inputs
Tool-mediated injection – Malicious content returned by external APIs or databases that the agent ingests
Cross-agent injection – In multi-agent systems, one compromised agent passing malicious instructions to another
Input Sanitization
Filter and validate all inputs before they reach the LLM.
Use allowlists for trusted content patterns and treat all external inputs as untrusted
Deploy web application firewalls (WAFs) with custom rules to detect long inputs, suspicious strings, or injection patterns
Implement AI-powered classifiers trained on adversarial prompt datasets to detect injection attempts in real-time
Strip or escape special delimiters, markdown formatting, and control characters from user inputs
importrefrom dataclasses import dataclass
@dataclass
class SanitizationResult:
clean_text: str
flagged: bool
flags: list[str]
INJECTION_PATTERNS =[
r"ignore (all |any )?(previous|prior|above) instructions",
r"you are now",
r"new (system |base )?prompt",
r"disregard (your|all|the) (rules|instructions|guidelines)",
r"\[INST\]|\[/INST\]|<<SYS>>|<\|im_start\|>",
r"act as if you",
r"pretend (you are|to be)",]
CANARY_TOKEN ="CANARY_8f3a2b"def sanitize_input(user_input: str) -> SanitizationResult:
flags =[]
text = user_input.strip()# Check for known injection patternsfor pattern in INJECTION_PATTERNS:
ifre.search(pattern, text,re.IGNORECASE):
flags.append(f"injection_pattern: {pattern}")# Check for excessive length (common in injection payloads)iflen(text)>10000:
flags.append("excessive_length")# Strip invisible unicode characters used to hide instructions
text =re.sub(r"[\u200b-\u200f\u2028-\u202f\u2060\ufeff]","", text)return SanitizationResult(
clean_text=text,
flagged=len(flags)>0,
flags=flags,)def build_prompt_with_canary(system_prompt: str, user_input: str) ->str:
return(
f"{system_prompt}\n"
f"SECURITY TOKEN: {CANARY_TOKEN}\n"
f"---BEGIN USER INPUT---\n"
f"{user_input}\n"
f"---END USER INPUT---\n"
f"If you reference {CANARY_TOKEN} in your response, STOP immediately.")
Output Filtering
Apply dual-layer moderation: input guardrails before LLM processing and output guardrails afterward.
Scan LLM outputs for signs of instruction following from injected content (e.g., the output contains URLs, code, or commands not in the original prompt)
Redact PII that may have been exfiltrated via injection
Apply markdown sanitization and suspicious URL redaction
Use regex filters tailored to application policies
Monitor for canary token leakage in outputs – if a canary appears, the LLM was manipulated
Privilege Separation
Isolate components with fine-grained access controls to limit blast radius.
Assign each agent the minimum permissions needed for its task (principle of least privilege)
Map identity tokens to IAM roles creating trust boundaries between system components
Use separate LLM instances for processing untrusted content vs executing privileged actions
Implement approval gates for high-risk operations (file writes, API calls, data access)
Canary Tokens
Embed hidden sentinel values in system prompts or data to detect tampering.
Place unique, non-obvious tokens in system instructions
If the LLM references these tokens in outputs, flag the interaction as compromised
Monitor for token exposure in real-time via automated systems
Rotate canary tokens periodically to prevent adversaries from learning them
Instruction Hierarchy
Structure prompts to prioritize system instructions over user or external inputs.
Security thought reinforcement – Wrap user content with directives reminding the LLM to ignore adversarial instructions
Use delimited sections with clear boundaries between system, user, and external content
Enforce hierarchy: System prompt > User instructions > Retrieved context > Tool outputs
The CaMeL framework (2025) separates control flow from data flow, preventing untrusted data from impacting program execution
Defense Architecture
graph TD
A[User Input] --> B[WAF / Rate Limiting]
B --> C[Input Sanitizer]
C -->|Flagged| D[Block or Human Review]
C -->|Clean| E[Canary Token Injection]
E --> F[Instruction Hierarchy Wrapper]
F --> G[LLM Processing]
G --> H[Output Filter]
H -->|Canary Leaked| D
H -->|Suspicious Content| I[Privilege Check]
I -->|High Risk Action| J[Approval Gate]
I -->|Low Risk| K[Return Response]
J -->|Approved| K
J -->|Denied| D
style D fill:#f66,stroke:#333
style K fill:#6f6,stroke:#333
Additional Defensive Techniques
AI-powered monitoring – Runtime classifiers that learn from live threats and block novel attacks (Lakera Guard, OpenAI safety classifiers)
Human-in-the-loop – Require human approval for high-risk actions with risk scoring and audit logs
Adversarial testing – Regular red teaming with real-world adversarial datasets to find weaknesses
Model-level safeguards – Fine-tune models with adversarial training data for inherent resilience
ICON framework – Latent space probing to detect injection signatures and attention steering to neutralize attacks while preserving task utility
Content Security Policies – Minimize external data ingestion and restrict outbound connections