====== Retrospective Intrinsic Feedback ====== **Retrospective Intrinsic Feedback** is a mechanism for progressive strategy evolution in LLM-based agents, formalized in the **RetroAgent** framework ([[https://arxiv.org/abs/2603.08561|arXiv:2603.08561]]). RetroAgent introduces a dual feedback system combining intrinsic numerical rewards with language-based reflections, enabling agents to learn from experience through online reinforcement learning without requiring external reward signals. ===== Overview ===== Standard RL approaches for LLM agents rely on sparse, binary task-completion rewards — the agent either succeeds or fails, with no signal for partial progress. This makes learning inefficient, especially in complex multi-step environments where full task completion is rare in early training. RetroAgent addresses this with a **hindsight self-reflection mechanism** that generates two complementary forms of intrinsic feedback after each episode, providing dense learning signals that capture partial progress and reusable strategic insights. ===== Dual Feedback Mechanism ===== === Intrinsic Numerical Feedback === Rather than relying solely on binary success/failure, RetroAgent generates numerical scores that track **incremental subtask completion** relative to prior attempts. This rewards promising exploration paths even when the full task is not completed, providing a shaped reward signal that guides policy improvement. For example, if an agent in ALFWorld completes 3 of 5 required steps in a household task, it receives proportional credit — enabling the RL algorithm to reinforce the successful partial strategy. === Intrinsic Language Feedback === After each episode, the agent performs hindsight reflection to distill **reusable strategic lessons** into natural language. These lessons are stored in a memory buffer and retrieved using a **Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB)** strategy that balances three factors: * **Relevance** — Semantic similarity between the current task and stored lessons * **Utility** — Historical effectiveness of each lesson (how often it led to improvement) * **Exploration** — UCB bonus ensuring less-used lessons are periodically re-evaluated This retrieval strategy ensures the agent both exploits proven strategies and explores potentially valuable but under-tested ones. ===== Architecture ===== RetroAgent features two implementation variants: * **In-Context Retrospection** — The reflection mechanism operates within the LLM's context window, generating feedback as part of the agent's standard inference pass * **RL-Trained Retrospection** — A dedicated retrospection policy is jointly optimized alongside the decision-making policy, using REINFORCE for the retrospection component and GRPO for decision-making Both variants share the dual feedback loop: - Agent attempts task, producing an action trajectory - Hindsight reflection generates numerical score and language lesson - Numerical feedback updates the RL policy via GRPO - Language lesson is stored in the memory buffer - On next similar task, SimUtil-UCB retrieves relevant lessons as context ===== Code Example ===== Simplified RetroAgent feedback loop: from dataclasses import dataclass, field @dataclass class Lesson: content: str task_embedding: list[float] utility_score: float = 0.0 usage_count: int = 0 class RetroAgentFeedback: def __init__(self, llm, embedding_model): self.llm = llm self.embedder = embedding_model self.memory: list[Lesson] = [] self.ucb_c = 1.4 # exploration constant def generate_numerical_feedback( self, trajectory: list[dict], subtask_checklist: list[str] ) -> float: completed = sum( 1 for subtask in subtask_checklist if any(subtask in step["action"] for step in trajectory) ) return completed / len(subtask_checklist) def generate_language_feedback( self, trajectory: list[dict], task: str, score: float ) -> str: prompt = ( f"Task: {task}\nScore: {score}\n" f"Trajectory: {trajectory[-5:]}\n\n" "Distill one reusable strategic lesson from this attempt." ) return self.llm.generate(prompt) def retrieve_lessons(self, task: str, k: int = 3) -> list[Lesson]: task_emb = self.embedder.encode(task) import math total_uses = sum(l.usage_count for l in self.memory) + 1 scored = [] for lesson in self.memory: similarity = self._cosine_sim(task_emb, lesson.task_embedding) ucb_bonus = self.ucb_c * math.sqrt( math.log(total_uses) / (lesson.usage_count + 1) ) score = similarity * lesson.utility_score + ucb_bonus scored.append((score, lesson)) scored.sort(key=lambda x: x[0], reverse=True) return [lesson for _, lesson in scored[:k]] def _cosine_sim(self, a, b): dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x**2 for x in a) ** 0.5 norm_b = sum(x**2 for x in b) ** 0.5 return dot / (norm_a * norm_b + 1e-8) ===== Benchmark Results ===== RetroAgent achieves state-of-the-art results across four challenging agentic environments, substantially outperforming GRPO-trained baselines: ^ Environment ^ Improvement over GRPO ^ | ALFWorld | +18.3% | | WebShop | +15.4% | | Sokoban | +27.1% | | MineSweeper | +8.9% | The framework was validated using **Qwen-2.5-7B-Instruct** and **Llama-3.1-8B-Instruct**, demonstrating strong test-time adaptation and generalization to out-of-distribution scenarios. ===== References ===== * [[https://arxiv.org/abs/2603.08561|arXiv:2603.08561 — RetroAgent: Retrospective Intrinsic Feedback for Progressive Strategy Evolution]] * [[https://huggingface.co/papers/2603.08561|HuggingFace Papers — RetroAgent]] ===== See Also ===== * [[self_evolving_agents|Self-Evolving Agents]] * [[reinforcement_learning_agents|Reinforcement Learning for Agents]] * [[agent_memory|Agent Memory Systems]] * [[chain_of_thought|Chain-of-Thought Prompting]]