Step-Back Prompting

Step-Back Prompting is a reasoning technique introduced by Zheng et al. at Google DeepMind in 2023 that improves LLM performance on complex tasks by first abstracting the problem to high-level principles before attempting detailed reasoning.¹⁾ The method draws inspiration from how human experts approach difficult problems, by stepping back to identify the relevant concepts before diving into specifics.

Overview

Standard prompting and even Chain-of-Thought (CoT) methods can fail on complex reasoning tasks because they attempt to reason directly over low-level details, leading to compounding errors in intermediate steps. Step-Back Prompting addresses this by inserting an abstraction step that identifies the relevant principles, concepts, or frameworks before the model reasons toward a solution.

Method

The technique operates in two phases:

Abstraction Phase: Given the original question, the LLM generates a higher-level “step-back question” that targets the underlying principles. For example:
- Original: “What happens to the pressure of an ideal gas if temperature increases by factor 2 and volume increases by factor 8?”
- Step-back question: “What is the Ideal Gas Law and its key relationships?”
Reasoning Phase: The LLM answers the original question by explicitly referencing the derived high-level concepts. The full prompt concatenates the step-back question, its answer, and the original question.

This two-phase approach grounds the reasoning chain in verified principles, reducing the likelihood of hallucination or faulty intermediate steps.

Formal Description

Given an original question $q$, the process is:

$$q_{\text{sb}} = \text{StepBack}(q)$$

$$p = \text{LLM}(q_{\text{sb}})$$

$$a = \text{LLM}(q \mid q_{\text{sb}}, p)$$

where $q_{\text{sb}}$ is the step-back question, $p$ is the derived principle or concept, and $a$ is the final answer conditioned on both the abstraction and the original question.

The abstraction function can be viewed as a mapping from a specific instance to a general class:

$$\text{StepBack}: \mathcal{Q}_{\text{specific}} \rightarrow \mathcal{Q}_{\text{abstract}}$$

Code Example

import [[openai|openai]]
 
def step_back_prompting(question, client):
    # Phase 1: Generate step-back question
    sb_prompt = (
        "You are an expert at abstracting problems to their core principles.\n"
        f"Given this question: {question}\n"
        "What is a more general step-back question that identifies "
        "the underlying principles needed to solve this?"
    )
    step_back_q = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": sb_prompt}]
    ).choices[0].message.content
 
    # Phase 1b: Answer the step-back question
    principles = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": step_back_q}]
    ).choices[0].message.content
 
    # Phase 2: Reason over original question with principles
    reason_prompt = (
        f"Step-back question: {step_back_q}\n"
        f"Relevant principles: {principles}\n\n"
        f"Now answer the original question using the above principles:\n"
        f"{question}"
    )
    answer = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": reason_prompt}]
    ).choices[0].message.content
 
    return answer

Experimental Results

Evaluated primarily on PaLM-2L (340B parameters):

Benchmark	Baseline (CoT)	Step-Back	Improvement
MMLU Physics	66.4%	73.4%	+7.0%
MMLU Chemistry	70.9%	81.9%	+11.0%
TimeQA	41.5%	68.5%	+27.0%
MuSiQue	35.5%	42.5%	+7.0%

Error analysis on MMLU Physics shows Step-Back corrects approximately 20.5% of baseline errors while introducing only 11.9% new errors. Most residual errors stem from the LLM's intrinsic reasoning limits rather than abstraction failures.

Results generalize to GPT-4 and LLaMA2-70B, confirming the technique is model-independent.

Comparison with Chain-of-Thought

Step-Back Prompting outperforms CoT by up to 36% on select tasks.²⁾ The key difference is that CoT decomposes problems linearly via intermediate steps, risking compounding errors from early detail immersion. Step-Back preempts this by establishing a correct conceptual framework first. It also outperforms variants like zero-shot CoT and “take-a-deep-breath” prompting.

References

¹⁾ , ²⁾

Zheng et al. "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." arXiv:2310.06117, 2023.

Table of Contents