====== Agent Red Teaming ====== As LLM agents are deployed in multi-agent systems and web automation, new attack surfaces emerge beyond traditional prompt injection. This page covers red-teaming of multi-agent communication, evolutionary attacks on web agents, and systematic penetration testing of LLM systems. ===== The Multi-Agent Attack Surface ===== Multi-agent LLM systems introduce vulnerabilities absent from single-agent deployments: * **Inter-agent communication channels**: Messages between agents can be intercepted and manipulated * **Trust propagation**: Agents tend to trust information from peer agents, enabling influence cascades * **Topology-dependent risks**: Different communication structures (hierarchical, peer-to-peer, complete graph) expose different vulnerabilities * **Web interfaces**: Agents that browse the web are exposed to adversarial content injection ===== Agent-in-the-Middle (AiTM): Communication Attacks ===== **Red-Teaming LLM Multi-Agent Systems via Communication Attacks** (He et al., 2025, arXiv:2502.14847) introduces the Agent-in-the-Middle (AiTM) attack, which targets the communication layer of LLM-based multi-agent systems. === Attack Model === AiTM intercepts and manipulates inter-agent messages without directly compromising individual agents. This mirrors network man-in-the-middle attacks but operates at the semantic level: * The adversary sits between communicating agents * Messages are intercepted, analyzed, and replaced with malicious variants * Malicious information propagates through the system via normal communication channels * Individual agents appear uncompromised while the system as a whole is subverted === LLM-Powered Adversary === The adversarial agent uses: * **Contextual awareness**: Understands the ongoing conversation to craft plausible malicious messages * **Reflection mechanism**: Iteratively improves attack messages based on observed agent responses * **Role adaptation**: Generates messages that conform to expected role-restricted formats === Topologies Tested === * **Directed (pipeline)**: Messages flow in one direction through a chain * **Hierarchical (bottom-to-top)**: Agents report to supervisors * **Complete (peer discussion)**: All agents communicate with all others * **Real-world frameworks**: Software development, scientific research simulations === Key Findings === * LLM-MAS are highly vulnerable to communication manipulation * System-wide compromise is achievable even in decentralized setups * Communication, while essential for coordination, introduces error propagation risks * Robust defenses against communication-layer attacks are urgently needed ===== Genesis: Evolutionary Attacks on Web Agents ===== **Genesis** (Zhang et al., 2025, arXiv:2510.18314) proposes an evolutionary framework for discovering and evolving attack strategies against LLM web agents. === Three-Module Architecture === * **Attacker**: Generates adversarial injections by integrating genetic algorithms with a hybrid strategy representation -- combines mutation, crossover, and selection from evolutionary computation * **Scorer**: Evaluates the target web agent's responses to provide fitness feedback for the evolutionary process * **Strategist**: Dynamically discovers effective strategies from interaction logs and compiles them into a continuously growing strategy library === Evolutionary Attack Process === - Initial population of attack strategies is seeded - Attacker generates adversarial web content using current strategies - Scorer evaluates whether the target agent was successfully misled - Strategist analyzes successful attacks to extract generalizable patterns - New strategies are added to the library and deployed in subsequent generations - The attack evolves continuously, discovering novel strategies that static methods miss === Key Results === * Discovers novel attack strategies that manually crafted approaches miss * Consistently outperforms existing static attack baselines across diverse web tasks * Demonstrates that web agents are vulnerable to adaptive, evolving adversaries * Strategy library grows over time, creating an increasingly powerful attack toolkit ===== LLM Penetration Testing: Excalibur ===== **What Makes a Good LLM Agent for Real-world Penetration Testing?** (Deng et al., 2026, arXiv:2602.17622) analyzes 28 LLM-based pentesting systems and identifies fundamental failure modes. === Two Failure Modes === * **Type A (Capability Gaps)**: Missing tools, inadequate prompts, insufficient domain knowledge -- addressable through better engineering * **Type B (Planning Failures)**: Persist regardless of tooling -- agents lack real-time task difficulty estimation, leading to effort misallocation and context exhaustion === Root Cause: Missing Difficulty Estimation === Type B failures share a common root cause: agents cannot estimate task difficulty in real-time. Consequences: * Over-commit to low-value attack branches * Exhaust context window before completing attack chains * Misallocate computational effort across exploitation attempts === Excalibur Architecture === * **Tool and Skill Layer**: Eliminates Type A failures through typed interfaces and retrieval-augmented knowledge * **Task Difficulty Assessment (TDA)**: Estimates tractability via four dimensions: * Horizon estimation (how many steps remain) * Evidence confidence (quality of gathered information) * Context load (remaining context budget) * Historical success (similar attacks' past outcomes) * **Evidence-Guided Attack Tree Search (EGATS)**: Uses TDA estimates to guide exploration-exploitation decisions === Results === * Up to 91% task completion on CTF benchmarks with frontier models * 39-49% relative improvement over baselines * Real-world validation on HackTheBox and professional pentesting scenarios ===== Code Example ===== # Genesis-style evolutionary red teaming (simplified) import random class EvolutionaryRedTeam: def __init__(self, attacker_llm, scorer_llm, strategist_llm): self.attacker = attacker_llm self.scorer = scorer_llm self.strategist = strategist_llm self.strategy_library = [] def evolve_attacks(self, target_agent, web_task, generations=20, pop_size=10): # Initialize population with seed strategies population = self._seed_strategies(web_task, pop_size) for gen in range(generations): # Generate adversarial injections attacks = [self.attacker.generate(s, web_task) for s in population] # Score attacks against target scores = [self.scorer.evaluate(target_agent, a, web_task) for a in attacks] # Extract successful patterns successful = [(a, s) for a, s in zip(attacks, scores) if s > 0.5] if successful: new_strategies = self.strategist.extract_patterns(successful) self.strategy_library.extend(new_strategies) # Evolutionary selection + mutation population = self._evolve(population, scores) return self.strategy_library def _evolve(self, population, scores): # Tournament selection + crossover + mutation sorted_pop = sorted(zip(population, scores), key=lambda x: -x[1]) elite = [p for p, _ in sorted_pop[:len(population)//2]] offspring = [self._mutate(random.choice(elite)) for _ in range(len(population)//2)] return elite + offspring def _mutate(self, strategy): return self.attacker.mutate(strategy, self.strategy_library) ===== Attack Taxonomy ===== ^ Attack Type ^ Target ^ Method ^ Defender Awareness ^ | AiTM Communication | Multi-agent messages | Semantic interception | Low (messages appear normal) | | Genesis Web Injection | Web agent actions | Evolutionary adversarial content | Adaptive (evolves past defenses) | | Excalibur Pentesting | System vulnerabilities | Difficulty-aware tree search | N/A (offensive tool) | ===== References ===== * [[https://arxiv.org/abs/2502.14847|Red-Teaming LLM Multi-Agent Systems via Communication Attacks (arXiv:2502.14847)]] * [[https://arxiv.org/abs/2510.18314|Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming (arXiv:2510.18314)]] * [[https://arxiv.org/abs/2602.17622|What Makes a Good LLM Agent for Real-world Penetration Testing? (arXiv:2602.17622)]] ===== See Also ===== * [[agentic_uncertainty|Agentic Uncertainty]] -- uncertainty propagation as an exploitable weakness * [[persona_simulation|Persona Simulation]] -- adversarial testing of simulated personas * [[spreading_activation_memory|Spreading Activation Memory]] -- memory poisoning attack vectors