Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Chain of Hindsight (CoH) is a training approach that enables language models to learn from feedback by conditioning on sequences of model outputs paired with their evaluative feedback. Proposed by Liu et al., 2023 in “Chain of Hindsight Aligns Language Models with Feedback,” CoH presents the model with a chain of previous attempts along with hindsight annotations indicating what was good or bad about each attempt, training the model to generate improved outputs conditioned on positive feedback signals1). This method effectively converts any form of feedback into a training signal without requiring complex reinforcement learning algorithms.
The core idea behind CoH is to leverage the language model's natural ability to understand and act on textual instructions. Rather than training a separate reward model and running RL optimization (as in RLHF), CoH directly fine-tunes the model on sequences that demonstrate the improvement process itself.
The training procedure works as follows:
At inference time, the model is prompted with positive feedback prefixes (e.g., “Good:”) to elicit the desired high-quality behavior. The model has learned to associate positive feedback signals with the patterns of its best outputs.
The key insight is that the model learns not just from good examples, but from the contrast between poor and good outputs. By seeing the trajectory from bad to good, conditioned on feedback, the model internalizes what improvements look like.
CoH is flexible in how feedback is sourced and structured:
The feedback is polarity-agnostic in construction, both positive and negative feedback appear in the training chains. This is a significant advantage over methods like RLHF that primarily learn from preferred vs. rejected pairs, as CoH can incorporate nuanced feedback about specific aspects of quality (factuality, style, completeness, safety).
Data construction scales efficiently because multiple outputs per prompt can be generated cheaply from the base model, and feedback annotation can be partially automated.
| Method | Mechanism | Handles Negative Feedback | Optimization | Complexity |
| RLHF | Reward model + PPO | Poorly (focuses on preferred outputs) | Multi-stage RL loop, often unstable | High |
| DPO | Direct preference optimization | Via preference pairs only | Single-stage supervised, stable | Medium |
| CoH | Hindsight chains + supervised fine-tuning | Yes, directly in natural language | Standard autoregressive LM loss | Low |
Key advantages of CoH over RLHF:
Compared to DPO, CoH can incorporate feedback beyond simple binary preferences, including partial credit, aspect-specific feedback, and improvement suggestions.
Liu et al., 2023 evaluated CoH on summarization and dialogue tasks:
Applications of CoH extend to any domain where feedback is available: instruction following, code generation (learning from compiler errors and test failures), creative writing (learning from editorial feedback), and multi-turn dialogue improvement. The approach is particularly valuable when feedback is diverse and multi-dimensional, as natural language can capture subtleties that scalar rewards cannot.