This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| retroformer [2026/03/25 15:21] – Create Retroformer page: retrospective model with policy gradient for agent prompt refinement agent | retroformer [2026/03/30 22:17] (current) – Restructure: footnotes as references agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Retroformer: | ====== Retroformer: | ||
| - | Retroformer introduces a principled framework for **reinforcing LLM agents by learning a retrospective model** that automatically refines agent prompts from environment feedback through policy gradient optimization. Published by Yao et al. (2023) at ICLR 2024, it is among the first works to apply gradient-based optimization to language agent improvement. | + | Retroformer introduces a principled framework for **reinforcing LLM agents by learning a retrospective model** that automatically refines agent prompts from environment feedback through policy gradient optimization.((https:// |
| ===== Overview ===== | ===== Overview ===== | ||
| Line 7: | Line 7: | ||
| Most LLM agents use fixed prompts or rely on verbal self-reflection (e.g., Reflexion) without gradient-based learning. Retroformer addresses this gap by training a smaller, fine-tunable **retrospective model** that analyzes failed trajectories and generates improved reflections, | Most LLM agents use fixed prompts or rely on verbal self-reflection (e.g., Reflexion) without gradient-based learning. Retroformer addresses this gap by training a smaller, fine-tunable **retrospective model** that analyzes failed trajectories and generates improved reflections, | ||
| - | The key innovation: rather than hand-crafting reflection prompts or relying on LLM self-assessment, | + | The key innovation: rather than hand-crafting reflection prompts or relying on LLM self-assessment, |
| ===== Architecture ===== | ===== Architecture ===== | ||
| Line 31: | Line 31: | ||
| ===== Policy Gradient Optimization ===== | ===== Policy Gradient Optimization ===== | ||
| - | The retrospective model is optimized using policy gradients. Given a trajectory < | + | The retrospective model is optimized using policy gradients.((https:// |
| < | < | ||
| Line 43: | Line 43: | ||
| ===== Key Results ===== | ===== Key Results ===== | ||
| - | * On **AlfWorld** household tasks, Retroformer significantly outperforms frozen baselines | + | * On **AlfWorld** household tasks, Retroformer significantly outperforms frozen baselines((https:// |
| * Agents solve tasks within **3 retries**, with most improvement in early iterations | * Agents solve tasks within **3 retries**, with most improvement in early iterations | ||
| * **Generalizes across environments and tasks** through multi-task reward learning | * **Generalizes across environments and tasks** through multi-task reward learning | ||
| * Higher LoRA rank (e.g., r=4) yields slight additional gains in the retrospective model | * Higher LoRA rank (e.g., r=4) yields slight additional gains in the retrospective model | ||
| * Outperforms non-gradient baselines (e.g., Reflexion) that use verbal-only feedback | * Outperforms non-gradient baselines (e.g., Reflexion) that use verbal-only feedback | ||
| - | * Enhanced performance on **HotPotQA** question answering validates cross-domain applicability | + | * Enhanced performance on **HotPotQA** question answering validates cross-domain applicability(([[https:// |
| ===== Code Example ===== | ===== Code Example ===== | ||
| Line 82: | Line 82: | ||
| optimizer.step() | optimizer.step() | ||
| </ | </ | ||
| - | |||
| - | ===== References ===== | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 94: | Line 88: | ||
| * [[agent_finetuning|Agent Fine-tuning Methods]] | * [[agent_finetuning|Agent Fine-tuning Methods]] | ||
| * [[fireact_agent_finetuning|FireAct: | * [[fireact_agent_finetuning|FireAct: | ||
| + | |||
| + | ===== References ===== | ||