Agent Red Teaming
As LLM agents are deployed in multi-agent systems and web automation, new attack surfaces emerge beyond traditional prompt injection. This page covers red-teaming of multi-agent communication, evolutionary attacks on web agents, and systematic penetration testing of LLM systems.
The Multi-Agent Attack Surface
Multi-agent LLM systems introduce vulnerabilities absent from single-agent deployments:
Inter-agent communication channels: Messages between agents can be intercepted and manipulated
Trust propagation: Agents tend to trust information from peer agents, enabling influence cascades
Topology-dependent risks: Different communication structures (hierarchical, peer-to-peer, complete graph) expose different vulnerabilities
Web interfaces: Agents that browse the web are exposed to adversarial content injection
Agent-in-the-Middle (AiTM): Communication Attacks
Red-Teaming LLM Multi-Agent Systems via Communication Attacks (He et al., 2025, arXiv:2502.14847) introduces the Agent-in-the-Middle (AiTM) attack, which targets the communication layer of LLM-based multi-agent systems.
Attack Model
AiTM intercepts and manipulates inter-agent messages without directly compromising individual agents. This mirrors network man-in-the-middle attacks but operates at the semantic level:
The adversary sits between communicating agents
Messages are intercepted, analyzed, and replaced with malicious variants
Malicious information propagates through the system via normal communication channels
Individual agents appear uncompromised while the system as a whole is subverted
LLM-Powered Adversary
The adversarial agent uses:
Contextual awareness: Understands the ongoing conversation to craft plausible malicious messages
Reflection mechanism: Iteratively improves attack messages based on observed agent responses
Role adaptation: Generates messages that conform to expected role-restricted formats
Topologies Tested
Directed (pipeline): Messages flow in one direction through a chain
Hierarchical (bottom-to-top): Agents report to supervisors
Complete (peer discussion): All agents communicate with all others
Real-world frameworks: Software development, scientific research simulations
Key Findings
LLM-MAS are highly vulnerable to communication manipulation
System-wide compromise is achievable even in decentralized setups
Communication, while essential for coordination, introduces error propagation risks
Robust defenses against communication-layer attacks are urgently needed
Genesis: Evolutionary Attacks on Web Agents
Genesis (Zhang et al., 2025, arXiv:2510.18314) proposes an evolutionary framework for discovering and evolving attack strategies against LLM web agents.
Three-Module Architecture
Attacker: Generates adversarial injections by integrating genetic algorithms with a hybrid strategy representation – combines mutation, crossover, and selection from evolutionary computation
Scorer: Evaluates the target web agent's responses to provide fitness feedback for the evolutionary process
Strategist: Dynamically discovers effective strategies from interaction logs and compiles them into a continuously growing strategy library
Evolutionary Attack Process
Initial population of attack strategies is seeded
Attacker generates adversarial web content using current strategies
Scorer evaluates whether the target agent was successfully misled
Strategist analyzes successful attacks to extract generalizable patterns
New strategies are added to the library and deployed in subsequent generations
The attack evolves continuously, discovering novel strategies that static methods miss
Key Results
Discovers novel attack strategies that manually crafted approaches miss
Consistently outperforms existing static attack baselines across diverse web tasks
Demonstrates that web agents are vulnerable to adaptive, evolving adversaries
Strategy library grows over time, creating an increasingly powerful attack toolkit
LLM Penetration Testing: Excalibur
What Makes a Good LLM Agent for Real-world Penetration Testing? (Deng et al., 2026, arXiv:2602.17622) analyzes 28 LLM-based pentesting systems and identifies fundamental failure modes.
Two Failure Modes
Type A (Capability Gaps): Missing tools, inadequate prompts, insufficient domain knowledge – addressable through better engineering
Type B (Planning Failures): Persist regardless of tooling – agents lack real-time task difficulty estimation, leading to effort misallocation and context exhaustion
Root Cause: Missing Difficulty Estimation
Type B failures share a common root cause: agents cannot estimate task difficulty in real-time. Consequences:
Over-commit to low-value attack branches
Exhaust context window before completing attack chains
Misallocate computational effort across exploitation attempts
Excalibur Architecture
Tool and Skill Layer: Eliminates Type A failures through typed interfaces and retrieval-augmented knowledge
Task Difficulty Assessment (TDA): Estimates tractability via four dimensions:
Horizon estimation (how many steps remain)
Evidence confidence (quality of gathered information)
Context load (remaining context budget)
Historical success (similar attacks' past outcomes)
Evidence-Guided Attack Tree Search (EGATS): Uses TDA estimates to guide exploration-exploitation decisions
Results
Up to 91% task completion on CTF benchmarks with frontier models
39-49% relative improvement over baselines
Real-world validation on HackTheBox and professional pentesting scenarios
Code Example
# Genesis-style evolutionary red teaming (simplified)
import random
class EvolutionaryRedTeam:
def __init__(self, attacker_llm, scorer_llm, strategist_llm):
self.attacker = attacker_llm
self.scorer = scorer_llm
self.strategist = strategist_llm
self.strategy_library = []
def evolve_attacks(self, target_agent, web_task, generations=20, pop_size=10):
# Initialize population with seed strategies
population = self._seed_strategies(web_task, pop_size)
for gen in range(generations):
# Generate adversarial injections
attacks = [self.attacker.generate(s, web_task) for s in population]
# Score attacks against target
scores = [self.scorer.evaluate(target_agent, a, web_task) for a in attacks]
# Extract successful patterns
successful = [(a, s) for a, s in zip(attacks, scores) if s > 0.5]
if successful:
new_strategies = self.strategist.extract_patterns(successful)
self.strategy_library.extend(new_strategies)
# Evolutionary selection + mutation
population = self._evolve(population, scores)
return self.strategy_library
def _evolve(self, population, scores):
# Tournament selection + crossover + mutation
sorted_pop = sorted(zip(population, scores), key=lambda x: -x[1])
elite = [p for p, _ in sorted_pop[:len(population)//2]]
offspring = [self._mutate(random.choice(elite)) for _ in range(len(population)//2)]
return elite + offspring
def _mutate(self, strategy):
return self.attacker.mutate(strategy, self.strategy_library)
Attack Taxonomy
| Attack Type | Target | Method | Defender Awareness |
| AiTM Communication | Multi-agent messages | Semantic interception | Low (messages appear normal) |
| Genesis Web Injection | Web agent actions | Evolutionary adversarial content | Adaptive (evolves past defenses) |
| Excalibur Pentesting | System vulnerabilities | Difficulty-aware tree search | N/A (offensive tool) |
References
See Also