====== Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization ======
Retroformer introduces a principled framework for **reinforcing LLM agents by learning a retrospective model** that automatically refines agent prompts from environment feedback through policy gradient optimization.((https://arxiv.org/abs/2308.02151)) Published by Yao et al. (2023) at ICLR 2024(([[https://arxiv.org/abs/2308.02151|Yao et al. (2023) - Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization]])), it is among the first works to apply gradient-based optimization to language agent improvement.
===== Overview =====
Most LLM agents use fixed prompts or rely on verbal self-reflection (e.g., Reflexion) without gradient-based learning. Retroformer addresses this gap by training a smaller, fine-tunable **retrospective model** that analyzes failed trajectories and generates improved reflections, optimized via policy gradients from actual environment rewards.
The key innovation: rather than hand-crafting reflection prompts or relying on LLM self-assessment, Retroformer **learns** to produce better reflections through reward-driven optimization.(([[https://github.com/weirayao/Retroformer|Retroformer GitHub Repository]]))
===== Architecture =====
graph TD
A[Environment Task] --> B[Actor LLM - Frozen]
B --> C[Execute Trajectory tau_i]
C --> D[Receive Reward r_i]
D --> E[Retrospective Model M_r]
E --> F[Generate Reflection y_k]
F --> G[Append to Actor Prompt]
G --> B
D -->|Policy Gradient| H[Update M_r Parameters]
H --> E
The system comprises two models:
* **Actor** M_a: A frozen large LLM (e.g., GPT) that executes tasks. Treated as part of the environment.
* **Retrospective Model** M_r: A smaller fine-tunable LM that generates self-reflections from failed trajectories.
===== Policy Gradient Optimization =====
The retrospective model is optimized using policy gradients.((https://arxiv.org/abs/2308.02151)) Given a trajectory \tau_i and reward r_i, the model generates a reflection y_{k,i} from input x_{k,i} = \{\tau_i, r_i\}. The quality of this reflection is measured by the subsequent episode return:
\nabla_\theta J = \mathbb{E}\left[\sum_{k} G_{k,i+1} \nabla_\theta \log P_\theta(y_{k,i} | x_{k,i})\right]
where G_{k,i+1} is the return of the next episode after applying the reflection. This enables the retrospective model to learn which types of reflections lead to better task performance.
The reflection output summarizes:
* **Root cause** of the failure (e.g., mismatch with task description)
* **Action plan** for the next attempt (concise high-level strategy)
===== Key Results =====
* On **AlfWorld** household tasks, Retroformer significantly outperforms frozen baselines((https://arxiv.org/abs/2308.02151))
* Agents solve tasks within **3 retries**, with most improvement in early iterations
* **Generalizes across environments and tasks** through multi-task reward learning
* Higher LoRA rank (e.g., r=4) yields slight additional gains in the retrospective model
* Outperforms non-gradient baselines (e.g., Reflexion) that use verbal-only feedback
* Enhanced performance on **HotPotQA** question answering validates cross-domain applicability(([[https://proceedings.iclr.cc/paper_files/paper/2024/file/29f421fbdcc82aeb349d784d3aaccdb3-Paper-Conference.pdf|ICLR 2024 Conference Paper]]))
===== Code Example =====
# Retroformer-style retrospective learning loop
from transformers import AutoModelForCausalLM
import torch
# Frozen actor (large LLM) and trainable retrospective model
actor = load_frozen_actor('gpt-4')
retro_model = AutoModelForCausalLM.from_pretrained('retro-base')
optimizer = torch.optim.Adam(retro_model.parameters(), lr=1e-5)
for episode in range(num_episodes):
# Actor executes task with current prompt
trajectory, reward = actor.execute(task, prompt=current_prompt)
if not reward: # Failed episode
# Retrospective model generates reflection
reflection_input = format_reflection_input(trajectory, reward)
reflection = retro_model.generate(reflection_input)
# Update prompt with reflection
current_prompt = append_reflection(current_prompt, reflection)
# Next episode to get G_{k,i+1}
next_traj, next_reward = actor.execute(task, prompt=current_prompt)
# Policy gradient update on retrospective model
loss = -next_reward * log_prob(reflection, reflection_input)
loss.backward()
optimizer.step()
===== See Also =====
* [[reflexion|Reflexion: Verbal Reinforcement Learning]]
* [[agent_finetuning|Agent Fine-tuning Methods]]
* [[fireact_agent_finetuning|FireAct: Agent Fine-tuning]]
===== References =====