====== Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization ======

Retroformer introduces a principled framework for **reinforcing LLM agents by learning a retrospective model** that automatically refines agent prompts from environment feedback through policy gradient optimization.((https://arxiv.org/abs/2308.02151)) Published by Yao et al. (2023) at ICLR 2024(([[https://arxiv.org/abs/2308.02151|Yao et al. (2023) - Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization]])), it is among the first works to apply gradient-based optimization to language agent improvement.

===== Overview =====

Most LLM agents use fixed prompts or rely on verbal self-reflection (e.g., Reflexion) without gradient-based learning. Retroformer addresses this gap by training a smaller, fine-tunable **retrospective model** that analyzes failed trajectories and generates improved reflections, optimized via policy gradients from actual environment rewards.

The key innovation: rather than hand-crafting reflection prompts or relying on LLM self-assessment, Retroformer **learns** to produce better reflections through reward-driven optimization.(([[https://github.com/weirayao/Retroformer|Retroformer GitHub Repository]]))

===== Architecture =====

<mermaid>
graph TD
    A[Environment Task] --> B[Actor LLM - Frozen]
    B --> C[Execute Trajectory tau_i]
    C --> D[Receive Reward r_i]
    D --> E[Retrospective Model M_r]
    E --> F[Generate Reflection y_k]
    F --> G[Append to Actor Prompt]
    G --> B
    D -->|Policy Gradient| H[Update M_r Parameters]
    H --> E
</mermaid>

The system comprises two models:

  * **Actor** <latex>M_a</latex>: A frozen large LLM (e.g., GPT) that executes tasks. Treated as part of the environment.
  * **Retrospective Model** <latex>M_r</latex>: A smaller fine-tunable LM that generates self-reflections from failed trajectories.

===== Policy Gradient Optimization =====

The retrospective model is optimized using policy gradients.((https://arxiv.org/abs/2308.02151)) Given a trajectory <latex>\tau_i</latex> and reward <latex>r_i</latex>, the model generates a reflection <latex>y_{k,i}</latex> from input <latex>x_{k,i} = \{\tau_i, r_i\}</latex>. The quality of this reflection is measured by the subsequent episode return:

<latex>\nabla_\theta J = \mathbb{E}\left[\sum_{k} G_{k,i+1} \nabla_\theta \log P_\theta(y_{k,i} | x_{k,i})\right]</latex>

where <latex>G_{k,i+1}</latex> is the return of the next episode after applying the reflection. This enables the retrospective model to learn which types of reflections lead to better task performance.

The reflection output summarizes:
  * **Root cause** of the failure (e.g., mismatch with task description)
  * **Action plan** for the next attempt (concise high-level strategy)

===== Key Results =====

  * On **AlfWorld** household tasks, Retroformer significantly outperforms frozen baselines((https://arxiv.org/abs/2308.02151))
  * Agents solve tasks within **3 retries**, with most improvement in early iterations
  * **Generalizes across environments and tasks** through multi-task reward learning
  * Higher LoRA rank (e.g., r=4) yields slight additional gains in the retrospective model
  * Outperforms non-gradient baselines (e.g., Reflexion) that use verbal-only feedback
  * Enhanced performance on **HotPotQA** question answering validates cross-domain applicability(([[https://proceedings.iclr.cc/paper_files/paper/2024/file/29f421fbdcc82aeb349d784d3aaccdb3-Paper-Conference.pdf|ICLR 2024 Conference Paper]]))

===== Code Example =====

<code python>
# Retroformer-style retrospective learning loop
from transformers import AutoModelForCausalLM
import torch

# Frozen actor (large LLM) and trainable retrospective model
actor = load_frozen_actor('gpt-4')
retro_model = AutoModelForCausalLM.from_pretrained('retro-base')
optimizer = torch.optim.Adam(retro_model.parameters(), lr=1e-5)

for episode in range(num_episodes):
    # Actor executes task with current prompt
    trajectory, reward = actor.execute(task, prompt=current_prompt)

    if not reward:  # Failed episode
        # Retrospective model generates reflection
        reflection_input = format_reflection_input(trajectory, reward)
        reflection = retro_model.generate(reflection_input)

        # Update prompt with reflection
        current_prompt = append_reflection(current_prompt, reflection)

        # Next episode to get G_{k,i+1}
        next_traj, next_reward = actor.execute(task, prompt=current_prompt)

        # Policy gradient update on retrospective model
        loss = -next_reward * log_prob(reflection, reflection_input)
        loss.backward()
        optimizer.step()
</code>

===== See Also =====

  * [[reflexion|Reflexion: Verbal Reinforcement Learning]]
  * [[agent_finetuning|Agent Fine-tuning Methods]]
  * [[fireact_agent_finetuning|FireAct: Agent Fine-tuning]]

===== References =====