Agent Red Teaming

As LLM agents are deployed in multi-agent systems and web automation, new attack surfaces emerge beyond traditional prompt injection. This page covers red-teaming of multi-agent communication, evolutionary attacks on web agents, systematic penetration testing of LLM systems, and a unified framework for classifying agent vulnerabilities.

Agent Vulnerability Classification

Recent work has identified a taxonomy of attack genres that target different functional areas of autonomous agents¹⁾.

Attacks can be categorized into six primary genres based on which component or process they compromise:

Content Injection: Adversarial data inserted into the agent's perception layer (web pages, sensor feeds, documents)
Semantic Manipulation: Misleading information that exploits the agent's reasoning processes without altering underlying data
Cognitive State Alteration: Manipulation of the agent's internal state, including memory, reasoning chains, and confidence calibration
Behavioral Control: Direct hijacking of the agent's action selection and execution
Systemic Disruption: Attacks on the agent's coordination with other agents, tools, and external systems
Human-in-the-Loop Exploitation: Attacks that manipulate human operators or bypass human oversight mechanisms

Effective defenses require layered protections combining technical controls, ecosystem standards, and legal frameworks rather than single-point solutions.

The Multi-Agent Attack Surface

Multi-agent LLM systems introduce vulnerabilities absent from single-agent deployments:

Inter-agent communication channels: Messages between agents can be intercepted and manipulated
Trust propagation: Agents tend to trust information from peer agents, enabling influence cascades
Topology-dependent risks: Different communication structures (hierarchical, peer-to-peer, complete graph) expose different vulnerabilities
Web interfaces: Agents that browse the web are exposed to adversarial content injection

Agent-in-the-Middle (AiTM): Communication Attacks

Red-Teaming LLM Multi-Agent Systems via Communication Attacks (He et al., 2025, arXiv:2502.14847) introduces the Agent-in-the-Middle (AiTM) attack, which targets the communication layer of LLM-based multi-agent systems.²⁾

Attack Model

AiTM intercepts and manipulates inter-agent messages without directly compromising individual agents. This mirrors network man-in-the-middle attacks but operates at the semantic level:

The adversary sits between communicating agents
Messages are intercepted, analyzed, and replaced with malicious variants
Malicious information propagates through the system via normal communication channels
Individual agents appear uncompromised while the system as a whole is subverted

LLM-Powered Adversary

The adversarial agent uses:

Contextual awareness: Understands the ongoing conversation to craft plausible malicious messages
Reflection mechanism: Iteratively improves attack messages based on observed agent responses
Role adaptation: Generates messages that conform to expected role-restricted formats

Topologies Tested

Directed (pipeline): Messages flow in one direction through a chain
Hierarchical (bottom-to-top): Agents report to supervisors
Complete (peer discussion): All agents communicate with all others
Real-world frameworks: Software development, scientific research simulations

Key Findings

LLM-MAS are highly vulnerable to communication manipulation
System-wide compromise is achievable even in decentralized setups
Communication, while essential for coordination, introduces error propagation risks
Robust defenses against communication-layer attacks are urgently needed

Genesis: Evolutionary Attacks on Web Agents

Genesis (Zhang et al., 2025, arXiv:2510.18314) proposes an evolutionary framework for discovering and evolving attack strategies against LLM web agents.³⁾

Three-Module Architecture

Attacker: Generates adversarial injections by integrating genetic algorithms with a hybrid strategy representation, combines mutation, crossover, and selection from evolutionary computation
Scorer: Evaluates the target web agent's responses to provide fitness feedback for the evolutionary process
Strategist: Dynamically discovers effective strategies from interaction logs and compiles them into a continuously growing strategy library

Evolutionary Attack Process

Initial population of attack strategies is seeded
Attacker generates adversarial web content using current strategies
Scorer evaluates whether the target agent was successfully misled
Strategist analyzes successful attacks to extract generalizable patterns
New strategies are added to the library and deployed in subsequent generations
The attack evolves continuously, discovering novel strategies that static methods miss

Key Results

Discovers novel attack strategies that manually crafted approaches miss
Consistently outperforms existing static attack baselines across diverse web tasks
Demonstrates that web agents are vulnerable to adaptive, evolving adversaries
Strategy library grows over time, creating an increasingly powerful attack toolkit

LLM Penetration Testing: Excalibur

What Makes a Good LLM Agent for Real-world Penetration Testing? (Deng et al., 2026, arXiv:2602.17622) analyzes 28 LLM-based pentesting systems and identifies fundamental failure modes.⁴⁾

Two Failure Modes

Type A (Capability Gaps): Missing tools, inadequate prompts, insufficient domain knowledge, addressable through better engineering
Type B (Planning Failures): Persist regardless of tooling, agents lack real-time task difficulty estimation, leading to effort misallocation and context exhaustion

Root Cause: Missing Difficulty Estimation

Type B failures share a common root cause: agents cannot estimate task difficulty in real-time. Consequences:

Over-commit to low-value attack branches
Exhaust context window before completing attack chains
Misallocate computational effort across exploitation attempts

Excalibur Architecture

Tool and Skill Layer: Eliminates Type A failures through typed interfaces and retrieval-augmented knowledge
Task Difficulty Assessment (TDA): Estimates tractability via four dimensions:
- Horizon estimation (how many steps remain)
- Evidence confidence (quality of gathered information)
- Context load (remaining context budget)
- Historical success (similar attacks' past outcomes)
Evidence-Guided Attack Tree Search (EGATS): Uses TDA estimates to guide exploration-exploitation decisions

Results

Up to 91% task completion on CTF benchmarks with frontier models
39-49% relative improvement over baselines
Real-world validation on HackTheBox and professional pentesting scenarios

Code Example

# Genesis-style evolutionary red teaming (simplified)
import random
 
class EvolutionaryRedTeam:
    def __init__(self, attacker_llm, scorer_llm, strategist_llm):
        self.attacker = attacker_llm
        self.scorer = scorer_llm
        self.strategist = strategist_llm
        self.strategy_library = []
 
    def evolve_attacks(self, target_agent, web_task, generations=20, pop_size=10):
        # Initialize population with seed strategies
        population = self._seed_strategies(web_task, pop_size)
 
        for gen in range(generations):
            # Generate adversarial injections
            attacks = [self.attacker.generate(s, web_task) for s in population]
 
            # Score attacks against target
            scores = [self.scorer.evaluate(target_agent, a, web_task) for a in attacks]
 
            # Extract successful patterns
            successful = [(a, s) for a, s in zip(attacks, scores) if s > 0.5]
            if successful:
                new_strategies = self.strategist.extract_patterns(successful)
                self.strategy_library.extend(new_strategies)
 
            # Evolutionary selection + mutation
            population = self._evolve(population, scores)
 
        return self.strategy_library
 
    def _evolve(self, population, scores):
        # Tournament selection + crossover + mutation
        sorted_pop = sorted(zip(population, scores), key=lambda x: -x[1])
        elite = [p for p, _ in sorted_pop[:len(population)//2]]
        offspring = [self._mutate(random.choice(elite)) for _ in range(len(population)//2)]
        return elite + offspring
 
    def _mutate(self, strategy):
        return self.attacker.mutate(strategy, self.strategy_library)

Attack Taxonomy

Attack Type	Target	Method	Defender Awareness
AiTM Communication	Multi-agent messages	Semantic interception	Low (messages appear normal)
Genesis Web Injection	Web agent actions	Evolutionary adversarial content	Adaptive (evolves past defenses)
Excalibur Pentesting	System vulnerabilities	Difficulty-aware tree search	N/A (offensive tool)

References

¹⁾

Google DeepMind - AI Agent Attack Genres

²⁾

He et al. “Red-Teaming LLM Multi-Agent Systems via Communication Attacks.” arXiv:2502.14847, 2025.

³⁾

Zhang et al. “Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming.” arXiv:2510.18314, 2025.

⁴⁾

Deng et al. “What Makes a Good LLM Agent for Real-world Penetration Testing?” arXiv:2602.17622, 2026.

Table of Contents