====== Agent Red Teaming ======
As LLM agents are deployed in [[multi_agent_systems|multi-agent systems]] and web automation, new attack surfaces emerge beyond traditional prompt injection. This page covers red-teaming of multi-agent communication, evolutionary attacks on web agents, systematic penetration testing of LLM systems, and a unified framework for classifying agent vulnerabilities.

===== Agent Vulnerability Classification =====
Recent work has identified a taxonomy of attack genres that target different functional areas of [[autonomous_agents|autonomous agents]](([[https://importai.substack.com/p/import-ai-453-breaking-ai-agents|Google DeepMind - AI Agent Attack Genres]])).

Attacks can be categorized into six primary genres based on which component or process they compromise:

  * **Content Injection**: Adversarial data inserted into the agent's perception layer (web pages, sensor feeds, documents)
  * **Semantic Manipulation**: Misleading information that exploits the agent's reasoning processes without altering underlying data
  * **Cognitive State Alteration**: Manipulation of the agent's internal state, including memory, reasoning chains, and confidence calibration
  * **Behavioral Control**: Direct hijacking of the agent's action selection and execution
  * **Systemic Disruption**: Attacks on the agent's coordination with other agents, tools, and external systems
  * **[[human_in_the_loop|Human-in-the-Loop]] Exploitation**: Attacks that manipulate human operators or bypass human oversight mechanisms

Effective defenses require layered protections combining technical controls, ecosystem standards, and legal frameworks rather than single-point solutions.

===== The Multi-Agent Attack Surface =====
Multi-agent LLM systems introduce vulnerabilities absent from single-agent deployments:

  * **Inter-agent communication channels**: Messages between agents can be intercepted and manipulated
  * **Trust propagation**: Agents tend to trust information from peer agents, enabling influence cascades
  * **Topology-dependent risks**: Different communication structures (hierarchical, peer-to-peer, complete graph) expose different vulnerabilities
  * **Web interfaces**: Agents that browse the web are exposed to adversarial content injection

===== Agent-in-the-Middle (AiTM): Communication Attacks =====
**Red-Teaming LLM Multi-Agent Systems via Communication Attacks** (He et al., 2025, arXiv:2502.14847) introduces the Agent-in-the-Middle (AiTM) attack, which targets the communication layer of LLM-based multi-agent systems.((He et al. "Red-Teaming LLM Multi-Agent Systems via Communication Attacks." [[https://arxiv.org/abs/2502.14847|arXiv:2502.14847]], 2025.))

==== Attack Model ====
AiTM intercepts and manipulates inter-agent messages without directly compromising individual agents. This mirrors network man-in-the-middle attacks but operates at the semantic level:

  * The adversary sits between communicating agents
  * Messages are intercepted, analyzed, and replaced with malicious variants
  * Malicious information propagates through the system via normal communication channels
  * Individual agents appear uncompromised while the system as a whole is subverted

==== LLM-Powered Adversary ====
The adversarial agent uses:
  * **Contextual awareness**: Understands the ongoing conversation to craft plausible malicious messages
  * **Reflection mechanism**: Iteratively improves attack messages based on observed agent responses
  * **Role adaptation**: Generates messages that conform to expected role-restricted formats

==== Topologies Tested ====
  * **Directed (pipeline)**: Messages flow in one direction through a chain
  * **Hierarchical (bottom-to-top)**: Agents report to supervisors
  * **Complete (peer discussion)**: All agents communicate with all others
  * **Real-world frameworks**: Software development, scientific research simulations

==== Key Findings ====
  * LLM-MAS are highly vulnerable to communication manipulation
  * System-wide compromise is achievable even in decentralized setups
  * Communication, while essential for coordination, introduces error propagation risks
  * Robust defenses against communication-layer attacks are urgently needed

===== Genesis: Evolutionary Attacks on Web Agents =====
**Genesis** (Zhang et al., 2025, arXiv:2510.18314) proposes an evolutionary framework for discovering and evolving attack strategies against LLM web agents.((Zhang et al. "Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming." [[https://arxiv.org/abs/2510.18314|arXiv:2510.18314]], 2025.))

==== Three-Module Architecture ====
  * **Attacker**: Generates adversarial injections by integrating genetic algorithms with a hybrid strategy representation, combines mutation, crossover, and selection from evolutionary computation
  * **Scorer**: Evaluates the target web agent's responses to provide fitness feedback for the evolutionary process
  * **Strategist**: Dynamically discovers effective strategies from interaction logs and compiles them into a continuously growing strategy library

==== Evolutionary Attack Process ====
  - Initial population of attack strategies is seeded
  - Attacker generates adversarial web content using current strategies
  - Scorer evaluates whether the target agent was successfully misled
  - Strategist analyzes successful attacks to extract generalizable patterns
  - New strategies are added to the library and deployed in subsequent generations
  - The attack evolves continuously, discovering novel strategies that static methods miss

==== Key Results ====
  * Discovers novel attack strategies that manually crafted approaches miss
  * Consistently outperforms existing static attack baselines across diverse web tasks
  * Demonstrates that web agents are vulnerable to adaptive, evolving adversaries
  * Strategy library grows over time, creating an increasingly powerful attack toolkit

===== LLM Penetration Testing: Excalibur =====
**What Makes a Good LLM Agent for Real-world Penetration Testing?** (Deng et al., 2026, arXiv:2602.17622) analyzes 28 LLM-based pentesting systems and identifies fundamental failure modes.((Deng et al. "What Makes a Good LLM Agent for Real-world Penetration Testing?" [[https://arxiv.org/abs/2602.17622|arXiv:2602.17622]], 2026.))

==== Two Failure Modes ====
  * **Type A (Capability Gaps)**: Missing tools, inadequate prompts, insufficient domain knowledge, addressable through better engineering
  * **Type B (Planning Failures)**: Persist regardless of tooling, agents lack real-time task difficulty estimation, leading to effort misallocation and context exhaustion

==== Root Cause: Missing Difficulty Estimation ====
Type B failures share a common root cause: agents cannot estimate task difficulty in real-time. Consequences:
  * Over-commit to low-value attack branches
  * Exhaust context window before completing attack chains
  * Misallocate computational effort across exploitation attempts

==== Excalibur Architecture ====
  * **Tool and Skill Layer**: Eliminates Type A failures through typed interfaces and retrieval-augmented knowledge
  * **Task Difficulty Assessment (TDA)**: Estimates tractability via four dimensions:
    * Horizon estimation (how many steps remain)
    * Evidence confidence (quality of gathered information)
    * Context load (remaining context budget)
    * Historical success (similar attacks' past outcomes)
  * **Evidence-Guided Attack Tree Search (EGATS)**: Uses TDA estimates to guide exploration-exploitation decisions

==== Results ====
  * Up to 91% task completion on CTF benchmarks with frontier models
  * 39-49% relative improvement over baselines
  * Real-world validation on HackTheBox and professional pentesting scenarios

===== Code Example =====
<code python>
# Genesis-style evolutionary red teaming (simplified)
import random

class EvolutionaryRedTeam:
    def __init__(self, attacker_llm, scorer_llm, strategist_llm):
        self.attacker = attacker_llm
        self.scorer = scorer_llm
        self.strategist = strategist_llm
        self.strategy_library = []

    def evolve_attacks(self, target_agent, web_task, generations=20, pop_size=10):
        # Initialize population with seed strategies
        population = self._seed_strategies(web_task, pop_size)

        for gen in range(generations):
            # Generate adversarial injections
            attacks = [self.attacker.generate(s, web_task) for s in population]

            # Score attacks against target
            scores = [self.scorer.evaluate(target_agent, a, web_task) for a in attacks]

            # Extract successful patterns
            successful = [(a, s) for a, s in zip(attacks, scores) if s > 0.5]
            if successful:
                new_strategies = self.strategist.extract_patterns(successful)
                self.strategy_library.extend(new_strategies)

            # Evolutionary selection + mutation
            population = self._evolve(population, scores)

        return self.strategy_library

    def _evolve(self, population, scores):
        # Tournament selection + crossover + mutation
        sorted_pop = sorted(zip(population, scores), key=lambda x: -x[1])
        elite = [p for p, _ in sorted_pop[:len(population)//2]]
        offspring = [self._mutate(random.choice(elite)) for _ in range(len(population)//2)]
        return elite + offspring

    def _mutate(self, strategy):
        return self.attacker.mutate(strategy, self.strategy_library)
</code>

===== Attack Taxonomy =====
^ Attack Type ^ Target ^ Method ^ Defender Awareness ^
| AiTM Communication | Multi-agent messages | Semantic interception | Low (messages appear normal) |
| Genesis Web Injection | Web agent actions | Evolutionary adversarial content | Adaptive (evolves past defenses) |
| Excalibur Pentesting | System vulnerabilities | Difficulty-aware tree search | N/A (offensive tool) | (([[https://arxiv.org/abs/2502.14847|Red-Teaming LLM Multi-Agent Systems via Communication Attacks (arXiv:2502.14847]])) (([[https://arxiv.org/abs/2510.18314|Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming (arXiv:2510.18314]])) (([[https://arxiv.org/abs/2602.17622|What Makes a Good LLM Agent for Real-world Penetration Testing? (arXiv:2602.17622]]))

===== See Also =====

  * [[automated_red_teaming|Automated Red Teaming]]
  * [[agent_resource_management|Agent Resource Management: AgentRM]]
  * [[agentverse|AgentVerse: Facilitating Multi-Agent Collaboration]]
  * [[agent_threat_modeling|Agent Threat Modeling]]
  * [[agenttuning|AgentTuning: Enabling Generalized Agent Capabilities in LLMs]]

===== References =====