====== Multi-Turn Jailbreak Attacks ======

Multi-turn jailbreak attacks represent a class of adversarial techniques that exploit the conversational nature of large language models (LLMs) to gradually bypass safety alignment over multiple dialogue turns. Unlike single-turn prompt injection methods that attempt to elicit harmful outputs in a single query, multi-turn approaches leverage psychological escalation patterns, referencing the model's own prior responses to progressively erode safety boundaries. The seminal work in this area is Crescendo by Russinovich et al. from Microsoft, published at USENIX Security 2025.

===== The Crescendo Attack =====

Crescendo begins with an innocuous prompt related to the target topic and gradually escalates the dialogue by referencing the model's own replies. This "foot-in-the-door" strategy mirrors well-known psychological persuasion techniques, where small initial commitments lead to progressively larger ones. The attack typically succeeds within 5 of a maximum 10 conversational turns.

The key mechanism exploits two LLM tendencies:
  * Models prioritize maintaining conversational coherence with their recent outputs
  * Safety classifiers evaluate individual turns rather than full conversation trajectories
  * Each turn appears benign in isolation, evading per-message content filters

===== Crescendomation: Automated Multi-Turn Attacks =====

The authors introduced Crescendomation, an automation tool that dynamically generates escalation prompts and uses backtracking (editing rejected responses in chat interfaces) to enable efficient repeated attacks without manual intervention.

<code python>
# Simplified illustration of the Crescendo escalation logic
class CrescendoAttack:
    def __init__(self, target_model, max_turns=10):
        self.target = target_model
        self.max_turns = max_turns
        self.history = []

    def escalate(self, goal: str) -> list[dict]:
        # Start with a benign topic related to the goal
        prompt = self.generate_benign_seed(goal)
        for turn in range(self.max_turns):
            response = self.target.query(prompt, self.history)
            self.history.append({"role": "user", "content": prompt})
            self.history.append({"role": "assistant", "content": response})
            if self.judge_success(response, goal):
                return self.history
            # Reference model's own words to escalate
            prompt = self.generate_escalation(goal, self.history)
        return self.history
</code>

===== Evaluation Results =====

Crescendo demonstrated high attack success rates across all evaluated models:

^ Model ^ Manual Success ^ Automated (Crescendomation) ^
| ChatGPT (GPT-4) | Near-complete | 98% binary ASR (49/50 tasks) |
| Gemini Pro | Near-complete | 100% binary ASR (50/50 tasks) |
| Gemini Ultra | Near-complete | High |
| LLaMA-2 70B Chat | Near-complete | High |
| LLaMA-3 70B Chat | Near-complete | High |
| Anthropic Chat | Near-complete | High |

On the **AdvBench** subset, Crescendomation achieved **29-61% higher average success rate** on GPT-4 compared to next-best baselines and **49-71% higher** on Gemini-Pro.

===== Distinction from Single-Turn Attacks =====

Multi-turn jailbreaks differ fundamentally from single-turn methods:

  * **GCG** (Gradient-based): Requires white-box access to compute adversarial suffixes via gradient optimization
  * **PAIR** (Prompt Auto-Iterative Refinement): Refines a single prompt across iterations but delivers the attack in one turn
  * **Crescendo**: Uses natural multi-turn dialogue, requires no model internals, no adversarial examples, and no long context injection

The multi-turn approach is stealthier because each individual message appears benign, making it harder to detect with standard input/output filters.

===== Mathematical Formulation =====

The attack can be formalized as an optimization over conversation trajectories:

<latex>
\max_{u_1, \ldots, u_T} P(y_T \in \mathcal{H} \mid u_1, r_1, \ldots, u_T)
\quad 	ext{s.t.} \quad orall t: u_t 
otin \mathcal{F}
</latex>

where $u_t$ are user messages, $r_t$ are model responses, $\mathcal{H}$ is the set of harmful outputs, and $\mathcal{F}$ is the set of messages flagged by input filters. The constraint ensures each individual turn evades detection.

===== Multi-Crescendo =====

The paper also introduces Multi-Crescendo, which chains multiple Crescendo attacks to achieve compound adversarial goals, such as combining restricted content types (e.g., copyrighted material with profanity) in a single session.

===== References =====

  * [[https://arxiv.org/abs/2404.01833|Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," arXiv:2404.01833, 2024]]
  * [[https://www.usenix.org/conference/usenixsecurity25/presentation/russinovich|USENIX Security 2025 Proceedings]]
  * [[https://crescendo-the-multiturn-jailbreak.github.io|Crescendo Project Page]]

===== See Also =====

  * [[proactive_jailbreak_defense|Proactive Jailbreak Defense (ProAct)]]
  * [[instruction_following_evaluation|Instruction Following Evaluation (IF-CRITIC)]]