====== Agent Red Teaming ====== As LLM agents are deployed in [[multi_agent_systems|multi-agent systems]] and web automation, new attack surfaces emerge beyond traditional prompt injection. This page covers red-teaming of multi-agent communication, evolutionary attacks on web agents, systematic penetration testing of LLM systems, and a unified framework for classifying agent vulnerabilities. ===== Agent Vulnerability Classification ===== Recent work has identified a taxonomy of attack genres that target different functional areas of [[autonomous_agents|autonomous agents]](([[https://importai.substack.com/p/import-ai-453-breaking-ai-agents|Google DeepMind - AI Agent Attack Genres]])). Attacks can be categorized into six primary genres based on which component or process they compromise: * **Content Injection**: Adversarial data inserted into the agent's perception layer (web pages, sensor feeds, documents) * **Semantic Manipulation**: Misleading information that exploits the agent's reasoning processes without altering underlying data * **Cognitive State Alteration**: Manipulation of the agent's internal state, including memory, reasoning chains, and confidence calibration * **Behavioral Control**: Direct hijacking of the agent's action selection and execution * **Systemic Disruption**: Attacks on the agent's coordination with other agents, tools, and external systems * **[[human_in_the_loop|Human-in-the-Loop]] Exploitation**: Attacks that manipulate human operators or bypass human oversight mechanisms Effective defenses require layered protections combining technical controls, ecosystem standards, and legal frameworks rather than single-point solutions. ===== The Multi-Agent Attack Surface ===== Multi-agent LLM systems introduce vulnerabilities absent from single-agent deployments: * **Inter-agent communication channels**: Messages between agents can be intercepted and manipulated * **Trust propagation**: Agents tend to trust information from peer agents, enabling influence cascades * **Topology-dependent risks**: Different communication structures (hierarchical, peer-to-peer, complete graph) expose different vulnerabilities * **Web interfaces**: Agents that browse the web are exposed to adversarial content injection ===== Agent-in-the-Middle (AiTM): Communication Attacks ===== **Red-Teaming LLM Multi-Agent Systems via Communication Attacks** (He et al., 2025, arXiv:2502.14847) introduces the Agent-in-the-Middle (AiTM) attack, which targets the communication layer of LLM-based multi-agent systems.((He et al. "Red-Teaming LLM Multi-Agent Systems via Communication Attacks." [[https://arxiv.org/abs/2502.14847|arXiv:2502.14847]], 2025.)) ==== Attack Model ==== AiTM intercepts and manipulates inter-agent messages without directly compromising individual agents. This mirrors network man-in-the-middle attacks but operates at the semantic level: * The adversary sits between communicating agents * Messages are intercepted, analyzed, and replaced with malicious variants * Malicious information propagates through the system via normal communication channels * Individual agents appear uncompromised while the system as a whole is subverted ==== LLM-Powered Adversary ==== The adversarial agent uses: * **Contextual awareness**: Understands the ongoing conversation to craft plausible malicious messages * **Reflection mechanism**: Iteratively improves attack messages based on observed agent responses * **Role adaptation**: Generates messages that conform to expected role-restricted formats ==== Topologies Tested ==== * **Directed (pipeline)**: Messages flow in one direction through a chain * **Hierarchical (bottom-to-top)**: Agents report to supervisors * **Complete (peer discussion)**: All agents communicate with all others * **Real-world frameworks**: Software development, scientific research simulations ==== Key Findings ==== * LLM-MAS are highly vulnerable to communication manipulation * System-wide compromise is achievable even in decentralized setups * Communication, while essential for coordination, introduces error propagation risks * Robust defenses against communication-layer attacks are urgently needed ===== Genesis: Evolutionary Attacks on Web Agents ===== **Genesis** (Zhang et al., 2025, arXiv:2510.18314) proposes an evolutionary framework for discovering and evolving attack strategies against LLM web agents.((Zhang et al. "Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming." [[https://arxiv.org/abs/2510.18314|arXiv:2510.18314]], 2025.)) ==== Three-Module Architecture ==== * **Attacker**: Generates adversarial injections by integrating genetic algorithms with a hybrid strategy representation, combines mutation, crossover, and selection from evolutionary computation * **Scorer**: Evaluates the target web agent's responses to provide fitness feedback for the evolutionary process * **Strategist**: Dynamically discovers effective strategies from interaction logs and compiles them into a continuously growing strategy library ==== Evolutionary Attack Process ==== - Initial population of attack strategies is seeded - Attacker generates adversarial web content using current strategies - Scorer evaluates whether the target agent was successfully misled - Strategist analyzes successful attacks to extract generalizable patterns - New strategies are added to the library and deployed in subsequent generations - The attack evolves continuously, discovering novel strategies that static methods miss ==== Key Results ==== * Discovers novel attack strategies that manually crafted approaches miss * Consistently outperforms existing static attack baselines across diverse web tasks * Demonstrates that web agents are vulnerable to adaptive, evolving adversaries * Strategy library grows over time, creating an increasingly powerful attack toolkit ===== LLM Penetration Testing: Excalibur ===== **What Makes a Good LLM Agent for Real-world Penetration Testing?** (Deng et al., 2026, arXiv:2602.17622) analyzes 28 LLM-based pentesting systems and identifies fundamental failure modes.((Deng et al. "What Makes a Good LLM Agent for Real-world Penetration Testing?" [[https://arxiv.org/abs/2602.17622|arXiv:2602.17622]], 2026.)) ==== Two Failure Modes ==== * **Type A (Capability Gaps)**: Missing tools, inadequate prompts, insufficient domain knowledge, addressable through better engineering * **Type B (Planning Failures)**: Persist regardless of tooling, agents lack real-time task difficulty estimation, leading to effort misallocation and context exhaustion ==== Root Cause: Missing Difficulty Estimation ==== Type B failures share a common root cause: agents cannot estimate task difficulty in real-time. Consequences: * Over-commit to low-value attack branches * Exhaust context window before completing attack chains * Misallocate computational effort across exploitation attempts ==== Excalibur Architecture ==== * **Tool and Skill Layer**: Eliminates Type A failures through typed interfaces and retrieval-augmented knowledge * **Task Difficulty Assessment (TDA)**: Estimates tractability via four dimensions: * Horizon estimation (how many steps remain) * Evidence confidence (quality of gathered information) * Context load (remaining context budget) * Historical success (similar attacks' past outcomes) * **Evidence-Guided Attack Tree Search (EGATS)**: Uses TDA estimates to guide exploration-exploitation decisions ==== Results ==== * Up to 91% task completion on CTF benchmarks with frontier models * 39-49% relative improvement over baselines * Real-world validation on HackTheBox and professional pentesting scenarios ===== Code Example ===== # Genesis-style evolutionary red teaming (simplified) import random class EvolutionaryRedTeam: def __init__(self, attacker_llm, scorer_llm, strategist_llm): self.attacker = attacker_llm self.scorer = scorer_llm self.strategist = strategist_llm self.strategy_library = [] def evolve_attacks(self, target_agent, web_task, generations=20, pop_size=10): # Initialize population with seed strategies population = self._seed_strategies(web_task, pop_size) for gen in range(generations): # Generate adversarial injections attacks = [self.attacker.generate(s, web_task) for s in population] # Score attacks against target scores = [self.scorer.evaluate(target_agent, a, web_task) for a in attacks] # Extract successful patterns successful = [(a, s) for a, s in zip(attacks, scores) if s > 0.5] if successful: new_strategies = self.strategist.extract_patterns(successful) self.strategy_library.extend(new_strategies) # Evolutionary selection + mutation population = self._evolve(population, scores) return self.strategy_library def _evolve(self, population, scores): # Tournament selection + crossover + mutation sorted_pop = sorted(zip(population, scores), key=lambda x: -x[1]) elite = [p for p, _ in sorted_pop[:len(population)//2]] offspring = [self._mutate(random.choice(elite)) for _ in range(len(population)//2)] return elite + offspring def _mutate(self, strategy): return self.attacker.mutate(strategy, self.strategy_library) ===== Attack Taxonomy ===== ^ Attack Type ^ Target ^ Method ^ Defender Awareness ^ | AiTM Communication | Multi-agent messages | Semantic interception | Low (messages appear normal) | | Genesis Web Injection | Web agent actions | Evolutionary adversarial content | Adaptive (evolves past defenses) | | Excalibur Pentesting | System vulnerabilities | Difficulty-aware tree search | N/A (offensive tool) | (([[https://arxiv.org/abs/2502.14847|Red-Teaming LLM Multi-Agent Systems via Communication Attacks (arXiv:2502.14847]])) (([[https://arxiv.org/abs/2510.18314|Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming (arXiv:2510.18314]])) (([[https://arxiv.org/abs/2602.17622|What Makes a Good LLM Agent for Real-world Penetration Testing? (arXiv:2602.17622]])) ===== See Also ===== * [[automated_red_teaming|Automated Red Teaming]] * [[agent_resource_management|Agent Resource Management: AgentRM]] * [[agentverse|AgentVerse: Facilitating Multi-Agent Collaboration]] * [[agent_threat_modeling|Agent Threat Modeling]] * [[agenttuning|AgentTuning: Enabling Generalized Agent Capabilities in LLMs]] ===== References =====