This shows you the differences between two versions of the page.
| world_models_for_agents [2026/03/24 17:06] – Create page: World Models for Agents with researched content agent | world_models_for_agents [2026/03/24 17:45] (current) – Add LaTeX math formatting for transition model, RSSM, prediction loss, imagination objective agent | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| A world model typically combines several components: | A world model typically combines several components: | ||
| - | * **Transition model**: Predicts how environmental state changes given an action -- p(s_{t+1} | s_t, a_t) | + | * **Transition model**: Predicts how environmental state changes given an action -- $p(s_{t+1} | s_t, a_t)$ |
| - | * **Observation model**: Determines what the agent perceives in each state -- p(o_t | s_t) | + | * **Observation model**: Determines what the agent perceives in each state -- $p(o_t | s_t)$ |
| - | * **Reward predictor**: | + | * **Reward predictor**: |
| - | * **Latent state encoder**: Compresses high-dimensional observations into compact latent representations | + | * **Latent state encoder**: Compresses high-dimensional observations into compact latent representations |
| The agent can then " | The agent can then " | ||
| Line 20: | Line 20: | ||
| **DreamerV3** (Nature, 2025) achieves mastery across 150+ diverse tasks with a single configuration: | **DreamerV3** (Nature, 2025) achieves mastery across 150+ diverse tasks with a single configuration: | ||
| - | * Learns a Recurrent State-Space Model (RSSM) | + | * Learns a Recurrent State-Space Model (RSSM) |
| - | * Imagines future trajectories in latent space | + | |
| - | * Trains actor and critic entirely within imagined rollouts | + | $$h_t = f_\theta(h_{t-1}, |
| + | |||
| + | * Imagines future trajectories in latent space by rolling out the prior: $\hat{z}_t \sim p_\theta(\hat{z}_t | h_t)$ | ||
| + | * Trains actor and critic entirely within imagined rollouts, optimizing: | ||
| + | |||
| + | $$\mathcal{J}_{\text{actor}}(\psi) = \mathbb{E}_{\text{imagine}}\left[\sum_{t=0}^{H} \gamma^t \hat{r}_t\right]$$ | ||
| * Uses symlog normalization and percentile-based scaling for robustness | * Uses symlog normalization and percentile-based scaling for robustness | ||
| Line 72: | Line 78: | ||
| World models enable several planning strategies: | World models enable several planning strategies: | ||
| - | * **Forward rollout**: Simulate action sequences, select the one with highest cumulative predicted reward | + | * **Forward rollout**: Simulate action sequences, select the one with highest cumulative predicted reward: $a_{0:H}^* = \arg\max_{a_{0: |
| * **Model Predictive Control (MPC)**: Re-plan at every step using the latest state estimate | * **Model Predictive Control (MPC)**: Re-plan at every step using the latest state estimate | ||
| * **Tree search**: Explore branching futures (MCTS-style) within the world model | * **Tree search**: Explore branching futures (MCTS-style) within the world model | ||
| - | * **Latent planning**: Optimize action sequences directly in latent space via gradient descent | + | * **Latent planning**: Optimize action sequences directly in latent space via gradient descent |
| + | |||
| + | ===== Prediction Loss ===== | ||
| + | |||
| + | World models are trained by minimizing a composite loss over predicted observations, | ||
| + | |||
| + | $$\mathcal{L}_{\text{world}} = \mathbb{E}\left[\sum_{t=1}^{T}\left(\underbrace{\|o_t - \hat{o}_t\|^2}_{\text{reconstruction}} + \underbrace{(r_t - \hat{r}_t)^2}_{\text{reward prediction}} + \underbrace{D_{\text{KL}}(q(z_t|o_t) \| p(z_t|h_t))}_{\text{latent regularization}}\right)\right]$$ | ||
| + | |||
| + | The KL term encourages the prior (imagination) distribution to match the posterior (observation-conditioned) distribution, | ||
| ===== Sim-to-Real Transfer ===== | ===== Sim-to-Real Transfer ===== | ||