====== Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization ====== Retroformer introduces a principled framework for **reinforcing LLM agents by learning a retrospective model** that automatically refines agent prompts from environment feedback through policy gradient optimization.((https://arxiv.org/abs/2308.02151)) Published by Yao et al. (2023) at ICLR 2024(([[https://arxiv.org/abs/2308.02151|Yao et al. (2023) - Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization]])), it is among the first works to apply gradient-based optimization to language agent improvement. ===== Overview ===== Most LLM agents use fixed prompts or rely on verbal self-reflection (e.g., Reflexion) without gradient-based learning. Retroformer addresses this gap by training a smaller, fine-tunable **retrospective model** that analyzes failed trajectories and generates improved reflections, optimized via policy gradients from actual environment rewards. The key innovation: rather than hand-crafting reflection prompts or relying on LLM self-assessment, Retroformer **learns** to produce better reflections through reward-driven optimization.(([[https://github.com/weirayao/Retroformer|Retroformer GitHub Repository]])) ===== Architecture ===== graph TD A[Environment Task] --> B[Actor LLM - Frozen] B --> C[Execute Trajectory tau_i] C --> D[Receive Reward r_i] D --> E[Retrospective Model M_r] E --> F[Generate Reflection y_k] F --> G[Append to Actor Prompt] G --> B D -->|Policy Gradient| H[Update M_r Parameters] H --> E The system comprises two models: * **Actor** M_a: A frozen large LLM (e.g., GPT) that executes tasks. Treated as part of the environment. * **Retrospective Model** M_r: A smaller fine-tunable LM that generates self-reflections from failed trajectories. ===== Policy Gradient Optimization ===== The retrospective model is optimized using policy gradients.((https://arxiv.org/abs/2308.02151)) Given a trajectory \tau_i and reward r_i, the model generates a reflection y_{k,i} from input x_{k,i} = \{\tau_i, r_i\}. The quality of this reflection is measured by the subsequent episode return: \nabla_\theta J = \mathbb{E}\left[\sum_{k} G_{k,i+1} \nabla_\theta \log P_\theta(y_{k,i} | x_{k,i})\right] where G_{k,i+1} is the return of the next episode after applying the reflection. This enables the retrospective model to learn which types of reflections lead to better task performance. The reflection output summarizes: * **Root cause** of the failure (e.g., mismatch with task description) * **Action plan** for the next attempt (concise high-level strategy) ===== Key Results ===== * On **AlfWorld** household tasks, Retroformer significantly outperforms frozen baselines((https://arxiv.org/abs/2308.02151)) * Agents solve tasks within **3 retries**, with most improvement in early iterations * **Generalizes across environments and tasks** through multi-task reward learning * Higher LoRA rank (e.g., r=4) yields slight additional gains in the retrospective model * Outperforms non-gradient baselines (e.g., Reflexion) that use verbal-only feedback * Enhanced performance on **HotPotQA** question answering validates cross-domain applicability(([[https://proceedings.iclr.cc/paper_files/paper/2024/file/29f421fbdcc82aeb349d784d3aaccdb3-Paper-Conference.pdf|ICLR 2024 Conference Paper]])) ===== Code Example ===== # Retroformer-style retrospective learning loop from transformers import AutoModelForCausalLM import torch # Frozen actor (large LLM) and trainable retrospective model actor = load_frozen_actor('gpt-4') retro_model = AutoModelForCausalLM.from_pretrained('retro-base') optimizer = torch.optim.Adam(retro_model.parameters(), lr=1e-5) for episode in range(num_episodes): # Actor executes task with current prompt trajectory, reward = actor.execute(task, prompt=current_prompt) if not reward: # Failed episode # Retrospective model generates reflection reflection_input = format_reflection_input(trajectory, reward) reflection = retro_model.generate(reflection_input) # Update prompt with reflection current_prompt = append_reflection(current_prompt, reflection) # Next episode to get G_{k,i+1} next_traj, next_reward = actor.execute(task, prompt=current_prompt) # Policy gradient update on retrospective model loss = -next_reward * log_prob(reflection, reflection_input) loss.backward() optimizer.step() ===== See Also ===== * [[reflexion|Reflexion: Verbal Reinforcement Learning]] * [[agent_finetuning|Agent Fine-tuning Methods]] * [[fireact_agent_finetuning|FireAct: Agent Fine-tuning]] ===== References =====