Reflexion Framework

Reflexion is a framework that reinforces language agents through linguistic self-reflection rather than traditional weight updates. Introduced by Shinn et al., 2023¹⁾ in “Reflexion: Language Agents with Verbal Reinforcement Learning” (NeurIPS 2023), Reflexion equips agents with the ability to reflect on task feedback, store reflective text in an episodic memory buffer, and use those reflections to improve decision-making in subsequent trials. The approach has shown substantial improvements on sequential decision-making, coding, and reasoning tasks by enabling agents to learn from their mistakes without requiring expensive fine-tuning.

graph TD A[Attempt Task] --> E[Evaluate Result] E -->|Success| D[Done] E -->|Failure| R[Reflect on Failure] R --> M[Store in Memory] M --> A style D fill:#6f6,stroke:#333 style R fill:#fc6,stroke:#333

Self-Reflection Mechanism

Reflexion introduces a self-reflection step where the LLM analyzes a failed trajectory and produces a natural language explanation of what went wrong and how to improve. This reflection is distinct from simple error messages, it captures nuanced insights like “I searched too broadly and missed the specific detail” or “I should have verified the intermediate result before proceeding.”

The framework consists of three components:

Actor: Generates actions and trajectories using Chain-of-Thought or ReAct-style reasoning, augmented with reflection memory from prior attempts.
Evaluator: Scores the trajectory using task-specific reward signals. This can be a binary success/fail signal, a heuristic function, or an LLM-based evaluation.
Self-Reflection Model: Given the failed trajectory, the reward signal, and the current reflection memory, generates a verbal reflection that diagnoses the failure and suggests improvements.

The critical insight is that verbal feedback is richer than scalar rewards. A reflection like “I incorrectly assumed the function should handle edge cases internally rather than raising exceptions” conveys far more learning signal than a simple test failure, allowing the agent to make targeted corrections on the next attempt.

Python Example

from [[openai|openai]] import [[openai|OpenAI]]
 
client = [[openai|OpenAI]]()
 
def attempt_task(task: str, reflections: list[str]) -> str:
    """Actor: attempt the task, informed by past reflections."""
    reflection_context = ""
    if reflections:
        reflection_context = "Learn from these past reflections:\n"
        reflection_context += "\n".join(f"- {r}" for r in reflections)
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"{reflection_context}\nTask: {task}\nProvide your solution:"
        )}],
    )
    return resp.choices[0].message.content
 
def evaluate(task: str, solution: str) -> tuple[bool, str]:
    """Evaluator: check if the solution is correct."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Task: {task}\nSolution: {solution}\n"
            "Is this correct? Reply YES or NO, then explain briefly."
        )}],
    )
    answer = resp.choices[0].message.content
    return answer.upper().startswith("YES"), answer
 
def reflect(task: str, solution: str, feedback: str) -> str:
    """Self-Reflection: diagnose failure and suggest improvements."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Task: {task}\nMy attempt: {solution}\nFeedback: {feedback}\n"
            "Reflect: what went wrong and how should I improve next time?"
        )}],
    )
    return resp.choices[0].message.content
 
def reflexion_loop(task: str, max_trials: int = 3) -> str:
    """Run the Reflexion trial loop with episodic memory."""
    reflections = []
    for trial in range(1, max_trials + 1):
        solution = attempt_task(task, reflections)
        success, feedback = evaluate(task, solution)
        print(f"Trial {trial}: {'Success' if success else 'Failed'}")
        if success:
            return solution
        reflection = reflect(task, solution, feedback)
        reflections.append(reflection)  # episodic memory buffer
    return solution  # best effort after max trials

Episodic Memory and Reflection Storage

Reflections are stored in an episodic memory buffer implemented as a sliding window of recent reflections. On each subsequent trial, these reflections are prepended to the agent's context, providing a form of experience replay in natural language. The memory acts as a persistent learning signal across trials without modifying model weights.

Two memory strategies are used:

Sliding window: Keeps the most recent k reflections, ensuring the context stays within token limits
Embedding-based retrieval: For longer task sequences, reflections are embedded and retrieved based on relevance to the current state

This approach is analogous to experience replay in reinforcement learning, but operates entirely in the space of natural language rather than numerical state-action pairs.

Trial-Based Learning Loop

The Reflexion learning loop proceeds as follows:

Trial 1: The agent attempts the task using its base capabilities (e.g., ReAct with available tools)
Evaluation: The trajectory is evaluated (pass/fail, reward score)
Reflection: If the attempt failed, the self-reflection model analyzes the trajectory and generates a verbal reflection
Trial 2: The agent reattempts the task with the reflection prepended to its context
Repeat: This loop continues until success or a maximum number of trials is reached

Each trial benefits from accumulated reflections, creating a form of linguistic reinforcement learning where the policy improves through natural language self-critique rather than gradient updates.

Performance on Benchmarks and Tasks

Reflexion demonstrated strong results across diverse task types:

HumanEval (code generation): Achieved 91% pass@1, surpassing GPT-4's 80% baseline. The agent reflects on failed test cases and iteratively fixes bugs in its generated code.
AlfWorld (embodied decision-making): ReAct + Reflexion completed 130 out of 134 tasks, dramatically outperforming vanilla ReAct baselines. Reflections helped the agent learn task-specific strategies like checking locations systematically.
HotPotQA (multi-hop reasoning): Outperformed both standard CoT and CoT with episodic memory, as reflections captured reasoning errors specific to multi-hop question decomposition.
MBPP and LeetCode Hard (coding): Achieved state-of-the-art results, with reflections enabling the agent to fix subtle algorithmic errors across attempts.

The official implementation is available at github.com/noahshinn/reflexion, including code and experiment logs for AlfWorld, HotPotQA, and coding tasks, with support for multiple reflection strategies (REFLEXION, LAST_ATTEMPT_AND_REFLEXION).

Limitations: Reflexion depends on the quality of self-evaluation, if the model cannot accurately diagnose its failures, reflections may be misleading. The approach also requires multiple trials, increasing total compute. For very long tasks, the reflection memory may exceed context limits, requiring summarization or retrieval strategies.

References

¹⁾

Shinn et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. arXiv:2303.11366.

Table of Contents