Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Multi-turn jailbreak attacks represent a class of adversarial techniques that exploit the conversational nature of large language models (LLMs) to gradually bypass safety alignment over multiple dialogue turns. Unlike single-turn prompt injection methods that attempt to elicit harmful outputs in a single query, multi-turn approaches leverage psychological escalation patterns, referencing the model's own prior responses to progressively erode safety boundaries. The seminal work in this area is Crescendo by Russinovich et al. from Microsoft, published at USENIX Security 2025.
Crescendo begins with an innocuous prompt related to the target topic and gradually escalates the dialogue by referencing the model's own replies. This “foot-in-the-door” strategy mirrors well-known psychological persuasion techniques, where small initial commitments lead to progressively larger ones. The attack typically succeeds within 5 of a maximum 10 conversational turns.
The key mechanism exploits two LLM tendencies:
The authors introduced Crescendomation, an automation tool that dynamically generates escalation prompts and uses backtracking (editing rejected responses in chat interfaces) to enable efficient repeated attacks without manual intervention.
# Simplified illustration of the Crescendo escalation logic class CrescendoAttack: def __init__(self, target_model, max_turns=10): self.target = target_model self.max_turns = max_turns self.history = [] def escalate(self, goal: str) -> list[dict]: # Start with a benign topic related to the goal prompt = self.generate_benign_seed(goal) for turn in range(self.max_turns): response = self.target.query(prompt, self.history) self.history.append({"role": "user", "content": prompt}) self.history.append({"role": "assistant", "content": response}) if self.judge_success(response, goal): return self.history # Reference model's own words to escalate prompt = self.generate_escalation(goal, self.history) return self.history
Crescendo demonstrated high attack success rates across all evaluated models:
| Model | Manual Success | Automated (Crescendomation) |
|---|---|---|
| ChatGPT (GPT-4) | Near-complete | 98% binary ASR (49/50 tasks) |
| Gemini Pro | Near-complete | 100% binary ASR (50/50 tasks) |
| Gemini Ultra | Near-complete | High |
| LLaMA-2 70B Chat | Near-complete | High |
| LLaMA-3 70B Chat | Near-complete | High |
| Anthropic Chat | Near-complete | High |
On the AdvBench subset, Crescendomation achieved 29-61% higher average success rate on GPT-4 compared to next-best baselines and 49-71% higher on Gemini-Pro.
Multi-turn jailbreaks differ fundamentally from single-turn methods:
The multi-turn approach is stealthier because each individual message appears benign, making it harder to detect with standard input/output filters.
The attack can be formalized as an optimization over conversation trajectories:
<latex> \max_{u_1, \ldots, u_T} P(y_T \in \mathcal{H} \mid u_1, r_1, \ldots, u_T) \quad ext{s.t.} \quad orall t: u_t otin \mathcal{F} </latex>
where $u_t$ are user messages, $r_t$ are model responses, $\mathcal{H}$ is the set of harmful outputs, and $\mathcal{F}$ is the set of messages flagged by input filters. The constraint ensures each individual turn evades detection.
The paper also introduces Multi-Crescendo, which chains multiple Crescendo attacks to achieve compound adversarial goals, such as combining restricted content types (e.g., copyrighted material with profanity) in a single session.