Table of Contents

Step-Back Prompting

Step-Back Prompting is a reasoning technique introduced by Zheng et al. at Google DeepMind in 2023 that improves LLM performance on complex tasks by first abstracting the problem to high-level principles before attempting detailed reasoning.1) The method draws inspiration from how human experts approach difficult problems, by stepping back to identify the relevant concepts before diving into specifics.

Overview

Standard prompting and even Chain-of-Thought (CoT) methods can fail on complex reasoning tasks because they attempt to reason directly over low-level details, leading to compounding errors in intermediate steps. Step-Back Prompting addresses this by inserting an abstraction step that identifies the relevant principles, concepts, or frameworks before the model reasons toward a solution.

Method

The technique operates in two phases:

  1. Abstraction Phase: Given the original question, the LLM generates a higher-level “step-back question” that targets the underlying principles. For example:
    • Original: “What happens to the pressure of an ideal gas if temperature increases by factor 2 and volume increases by factor 8?”
    • Step-back question: “What is the Ideal Gas Law and its key relationships?”
  2. Reasoning Phase: The LLM answers the original question by explicitly referencing the derived high-level concepts. The full prompt concatenates the step-back question, its answer, and the original question.

This two-phase approach grounds the reasoning chain in verified principles, reducing the likelihood of hallucination or faulty intermediate steps.

Formal Description

Given an original question $q$, the process is:

$$q_{\text{sb}} = \text{StepBack}(q)$$

$$p = \text{LLM}(q_{\text{sb}})$$

$$a = \text{LLM}(q \mid q_{\text{sb}}, p)$$

where $q_{\text{sb}}$ is the step-back question, $p$ is the derived principle or concept, and $a$ is the final answer conditioned on both the abstraction and the original question.

The abstraction function can be viewed as a mapping from a specific instance to a general class:

$$\text{StepBack}: \mathcal{Q}_{\text{specific}} \rightarrow \mathcal{Q}_{\text{abstract}}$$

Code Example

import [[openai|openai]]
 
def step_back_prompting(question, client):
    # Phase 1: Generate step-back question
    sb_prompt = (
        "You are an expert at abstracting problems to their core principles.\n"
        f"Given this question: {question}\n"
        "What is a more general step-back question that identifies "
        "the underlying principles needed to solve this?"
    )
    step_back_q = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": sb_prompt}]
    ).choices[0].message.content
 
    # Phase 1b: Answer the step-back question
    principles = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": step_back_q}]
    ).choices[0].message.content
 
    # Phase 2: Reason over original question with principles
    reason_prompt = (
        f"Step-back question: {step_back_q}\n"
        f"Relevant principles: {principles}\n\n"
        f"Now answer the original question using the above principles:\n"
        f"{question}"
    )
    answer = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": reason_prompt}]
    ).choices[0].message.content
 
    return answer

Experimental Results

Evaluated primarily on PaLM-2L (340B parameters):

Benchmark Baseline (CoT) Step-Back Improvement
MMLU Physics 66.4% 73.4% +7.0%
MMLU Chemistry 70.9% 81.9% +11.0%
TimeQA 41.5% 68.5% +27.0%
MuSiQue 35.5% 42.5% +7.0%

Error analysis on MMLU Physics shows Step-Back corrects approximately 20.5% of baseline errors while introducing only 11.9% new errors. Most residual errors stem from the LLM's intrinsic reasoning limits rather than abstraction failures.

Results generalize to GPT-4 and LLaMA2-70B, confirming the technique is model-independent.

Comparison with Chain-of-Thought

Step-Back Prompting outperforms CoT by up to 36% on select tasks.2) The key difference is that CoT decomposes problems linearly via intermediate steps, risking compounding errors from early detail immersion. Step-Back preempts this by establishing a correct conceptual framework first. It also outperforms variants like zero-shot CoT and “take-a-deep-breath” prompting.

See Also

References