Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Retrospective Intrinsic Feedback is a mechanism for progressive strategy evolution in LLM-based agents, formalized in the RetroAgent framework (arXiv:2603.08561). RetroAgent introduces a dual feedback system combining intrinsic numerical rewards with language-based reflections, enabling agents to learn from experience through online reinforcement learning without requiring external reward signals.
Standard RL approaches for LLM agents rely on sparse, binary task-completion rewards — the agent either succeeds or fails, with no signal for partial progress. This makes learning inefficient, especially in complex multi-step environments where full task completion is rare in early training.
RetroAgent addresses this with a hindsight self-reflection mechanism that generates two complementary forms of intrinsic feedback after each episode, providing dense learning signals that capture partial progress and reusable strategic insights.
Rather than relying solely on binary success/failure, RetroAgent generates numerical scores that track incremental subtask completion relative to prior attempts. This rewards promising exploration paths even when the full task is not completed, providing a shaped reward signal that guides policy improvement.
For example, if an agent in ALFWorld completes 3 of 5 required steps in a household task, it receives proportional credit — enabling the RL algorithm to reinforce the successful partial strategy.
After each episode, the agent performs hindsight reflection to distill reusable strategic lessons into natural language. These lessons are stored in a memory buffer and retrieved using a Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy that balances three factors:
This retrieval strategy ensures the agent both exploits proven strategies and explores potentially valuable but under-tested ones.
RetroAgent features two implementation variants:
Both variants share the dual feedback loop:
Simplified RetroAgent feedback loop:
from dataclasses import dataclass, field @dataclass class Lesson: content: str task_embedding: list[float] utility_score: float = 0.0 usage_count: int = 0 class RetroAgentFeedback: def __init__(self, llm, embedding_model): self.llm = llm self.embedder = embedding_model self.memory: list[Lesson] = [] self.ucb_c = 1.4 # exploration constant def generate_numerical_feedback( self, trajectory: list[dict], subtask_checklist: list[str] ) -> float: completed = sum( 1 for subtask in subtask_checklist if any(subtask in step["action"] for step in trajectory) ) return completed / len(subtask_checklist) def generate_language_feedback( self, trajectory: list[dict], task: str, score: float ) -> str: prompt = ( f"Task: {task}\nScore: {score}\n" f"Trajectory: {trajectory[-5:]}\n\n" "Distill one reusable strategic lesson from this attempt." ) return self.llm.generate(prompt) def retrieve_lessons(self, task: str, k: int = 3) -> list[Lesson]: task_emb = self.embedder.encode(task) import math total_uses = sum(l.usage_count for l in self.memory) + 1 scored = [] for lesson in self.memory: similarity = self._cosine_sim(task_emb, lesson.task_embedding) ucb_bonus = self.ucb_c * math.sqrt( math.log(total_uses) / (lesson.usage_count + 1) ) score = similarity * lesson.utility_score + ucb_bonus scored.append((score, lesson)) scored.sort(key=lambda x: x[0], reverse=True) return [lesson for _, lesson in scored[:k]] def _cosine_sim(self, a, b): dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x**2 for x in a) ** 0.5 norm_b = sum(x**2 for x in b) ** 0.5 return dot / (norm_a * norm_b + 1e-8)
RetroAgent achieves state-of-the-art results across four challenging agentic environments, substantially outperforming GRPO-trained baselines:
| Environment | Improvement over GRPO |
|---|---|
| ALFWorld | +18.3% |
| WebShop | +15.4% |
| Sokoban | +27.1% |
| MineSweeper | +8.9% |
The framework was validated using Qwen-2.5-7B-Instruct and Llama-3.1-8B-Instruct, demonstrating strong test-time adaptation and generalization to out-of-distribution scenarios.