====== World Models for Agents ====== **World models** are internal representations that allow AI agents to simulate their environment, predict outcomes of actions, and plan strategies without directly interacting with the real world. They enable imagination-based planning where agents reason over future states before committing to actions. ===== Core Architecture ===== A world model typically combines several components: * **Transition model**: Predicts how environmental state changes given an action -- $p(s_{t+1} | s_t, a_t)$ * **Observation model**: Determines what the agent perceives in each state -- $p(o_t | s_t)$ * **Reward predictor**: Estimates expected reward for state-action pairs -- $\hat{r}(s_t, a_t)$ * **Latent state encoder**: Compresses high-dimensional observations into compact latent representations $z_t = \text{enc}(o_t)$ The agent can then "dream" -- simulate trajectories within the learned world model to evaluate plans without costly real-world interaction. ===== Dreamer Architecture ===== The **Dreamer** family (V1, V2, V3) by Danijar Hafner et al. represents the most successful line of world-model-based RL agents. **DreamerV3** (Nature, 2025) achieves mastery across 150+ diverse tasks with a single configuration: * Learns a Recurrent State-Space Model (RSSM) with deterministic state $h_t$ and stochastic state $z_t$: $$h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1}), \quad z_t \sim q_\theta(z_t | h_t, o_t)$$ * Imagines future trajectories in latent space by rolling out the prior: $\hat{z}_t \sim p_\theta(\hat{z}_t | h_t)$ * Trains actor and critic entirely within imagined rollouts, optimizing: $$\mathcal{J}_{\text{actor}}(\psi) = \mathbb{E}_{\text{imagine}}\left[\sum_{t=0}^{H} \gamma^t \hat{r}_t\right]$$ * Uses symlog normalization and percentile-based scaling for robustness Key achievement: DreamerV3 was the first algorithm to collect a diamond in Minecraft from scratch without human demonstrations -- a long-horizon task requiring hundreds of sequential decisions across multiple subgoals. # Simplified Dreamer imagination loop class WorldModel: def __init__(self, rssm, reward_head, decoder): self.rssm = rssm self.reward_head = reward_head self.decoder = decoder def imagine(self, initial_state, policy, horizon=15): """Generate imagined trajectory for planning.""" states, rewards = [initial_state], [] state = initial_state for t in range(horizon): action = policy(state) state = self.rssm.predict_next(state, action) reward = self.reward_head(state) states.append(state) rewards.append(reward) return states, rewards ===== Voyager: LLM-Powered World Knowledge ===== **Voyager** (NVIDIA, 2023) takes a fundamentally different approach -- using an LLM as the world model and planner for an embodied agent in Minecraft: * **Automatic curriculum**: LLM proposes progressively harder exploration goals * **Skill library**: Stores executable code snippets for complex behaviors, retrieved and composed for novel situations * **Environment feedback**: Execution results, game state, and error messages feed back to the LLM Unlike Dreamer's learned latent dynamics, Voyager leverages the LLM's pretrained world knowledge. It continuously discovers new skills without human intervention, demonstrating lifelong learning in an open-ended environment. ===== LLMs as World Models ===== Recent research (2025-2026) demonstrates that LLMs can serve directly as environment simulators: * Fine-tuned **Qwen2.5-7B** and **Llama-3.1-8B** achieved >99% accuracy predicting state transitions in ALFWorld and ~98.6% in SciWorld * Even without fine-tuning, Claude Sonnet achieved 77% accuracy with just 3 examples * Architecture: One "World Model LLM" simulates the environment while a separate "Agent LLM" plans and acts This decoupled approach enables training agents in simulated environments generated by LLMs, dramatically reducing the cost of environment interaction. ===== Imagination-Based Planning ===== World models enable several planning strategies: * **Forward rollout**: Simulate action sequences, select the one with highest cumulative predicted reward: $a_{0:H}^* = \arg\max_{a_{0:H}} \sum_{t=0}^{H} \gamma^t \hat{r}(s_t, a_t)$ * **Model Predictive Control (MPC)**: Re-plan at every step using the latest state estimate * **Tree search**: Explore branching futures (MCTS-style) within the world model * **Latent planning**: Optimize action sequences directly in latent space via gradient descent on $\nabla_{a_{0:H}} \sum_t \hat{r}(s_t, a_t)$ ===== Prediction Loss ===== World models are trained by minimizing a composite loss over predicted observations, rewards, and latent state distributions: $$\mathcal{L}_{\text{world}} = \mathbb{E}\left[\sum_{t=1}^{T}\left(\underbrace{\|o_t - \hat{o}_t\|^2}_{\text{reconstruction}} + \underbrace{(r_t - \hat{r}_t)^2}_{\text{reward prediction}} + \underbrace{D_{\text{KL}}(q(z_t|o_t) \| p(z_t|h_t))}_{\text{latent regularization}}\right)\right]$$ The KL term encourages the prior (imagination) distribution to match the posterior (observation-conditioned) distribution, ensuring that imagined rollouts remain faithful to real dynamics. ===== Sim-to-Real Transfer ===== For embodied agents, world models bridge simulation and reality: * Train policies in learned world models or physics simulators * Transfer to real robots with domain randomization * World models generate diverse training data for factory automation, warehouse robotics, autonomous vehicles * **NVIDIA Omniverse** and similar platforms use world models for synthetic data generation at scale * Key challenge: modeling the "reality gap" between simulated and real dynamics ===== Collaborative and Multi-Agent World Models ===== Advanced world agents maintain structured beliefs about environments and other agents: * **Collaborative Belief Worlds (CBW)**: Each agent tracks physical facts (zero-order beliefs) and models of collaborators' mental states (first-order beliefs) * **Aspective Agentic AI**: Partitions agents into information-based aspects where each observes only "their world" * Enables intent-aware communication under partial observability ===== Genie 3 (2025) ===== **Genie 3** (DeepMind, August 2025) represents a breakthrough in generative world models: * Generates diverse, interactive 3D environments from text prompts * Simulates realistic physics within generated worlds * Serves as training grounds for AI agents * Positioned as a milestone toward world-model-based AGI ===== References ===== * [[https://arxiv.org/abs/2506.22355|arXiv:2506.22355 - World Models for Agents]] * [[https://arxiv.org/abs/2301.04104|arXiv:2301.04104 - DreamerV3: Mastering Diverse Domains]] * [[https://arxiv.org/abs/2305.16291|arXiv:2305.16291 - Voyager: Open-Ended Embodied Agent with LLMs]] ===== See Also ===== * [[latent_reasoning|Latent Reasoning]] - Reasoning in continuous latent space * [[agentic_reinforcement_learning|Agentic Reinforcement Learning]] - RL for training LLM agents * [[test_time_compute_scaling|Test-Time Compute Scaling]] - Compute scaling via internal simulation