World models are internal computational simulators that allow AI agents to predict the next state of a dynamic system, represent underlying physics and causal relationships, and plan strategies without directly interacting with the real world1). An agent world model serves as an internal simulator — enabling imagination-based planning where agents reason over future states before committing to actions, moving AI from being a narrator to a competent operator in physical environments2). These integrated systems enable better decision-making through predictive simulation rather than direct reasoning improvement3). World models represent a critical frontier in Physical AI, which moves beyond low-bandwidth language abstractions to interact with the physical world by understanding spatial geometry and 4D reality4).
While Large Language Models function as brilliant narrators that excel at text prediction and abstraction, world models represent ground truth of physics and causality5). LLMs operate as low-bandwidth abstractions of reality, effective for language generation but limited in their ability to understand spatial geometry and causal mechanics. World models, by contrast, function as competent operators that directly model physical phenomena and state transitions. This distinction marks a fundamental shift in AI development: from the era of pure token prediction toward the era of physical simulation and embodied reasoning about dynamic environments.
A world model typically combines several components:
The agent can then “dream”, simulate trajectories within the learned world model to evaluate plans without costly real-world interaction.
The Dreamer family (V1, V2, V3) by Danijar Hafner et al. represents the most successful line of world-model-based RL agents.
DreamerV36) (Nature, 2025) achieves mastery across 150+ diverse tasks with a single configuration:
$$h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1}), \quad z_t \sim q_\theta(z_t | h_t, o_t)$$
$$\mathcal{J}_{\text{actor}}(\psi) = \mathbb{E}_{\text{imagine}}\left[\sum_{t=0}^{H} \[[gamma|gamma]]^t \hat{r}_t\right]$$
Key achievement: DreamerV3 was the first algorithm to collect a diamond in Minecraft from scratch without human demonstrations, a long-horizon task requiring hundreds of sequential decisions across multiple subgoals.
Simplified Dreamer imagination loop class WorldModel: def __init__(self, rssm, reward_head, decoder): self.rssm = rssm self.reward_head = reward_head self.decoder = decoder def imagine(self, initial_state, policy, horizon=15): """Generate imagined trajectory for planning.""" states, rewards = initial_state, [] state = initial_state for t in range(horizon): action = policy(state) state = self.rssm.predict_next(state, action) reward = self.reward_head(state) states.append(state) rewards.append(reward) return states, rewards
Voyager7) (NVIDIA, 2023) takes a fundamentally different approach, using an LLM as the world model and planner for an embodied agent in Minecraft:
Unlike Dreamer's learned latent dynamics, Voyager leverages the LLM's pretrained world knowledge. It continuously discovers new skills without human intervention, demonstrating lifelong learning in an open-ended environment.
Recent research (2025-2026) demonstrates that LLMs can serve directly as environment simulators:
This decoupled approach enables training agents in simulated environments generated by LLMs, dramatically reducing the cost of environment interaction.
NVIDIA Cosmos represents a new class of world models: foundation models designed to serve as physics engines for large-scale synthetic data generation8). Cosmos compresses spatiotemporal reality into tokenized representations that enable the model to process, predict, and generate complex physical phenomena. Such world foundation models are central to the industry's broader shift toward grounding machine intelligence in the physical world, enabling large-scale generation of synthetic training environments without explicit physics simulators.
Genie 3 (Google DeepMind, August 2025) represents a major breakthrough in generative world models by producing fully interactive, playable environments from a single image9). Unlike purely predictive world models, Genie 3 generates action-controllable 3D environments and simulates realistic physics within them, enabling agents to train in dynamically generated worlds. This positions Genie 3 as a milestone toward world-model-based AGI, demonstrating that world models can be constructive simulators rather than passive observers.
D4RT (DeepMind) represents an architectural leap in world model design by reconstructing dynamic 4D environments through unified perception and tracking10). Rather than predicting future states frame-by-frame, D4RT provides a highly parallelized, queryable interface for understanding how environments evolve across spatial and temporal dimensions. This approach advances spatiotemporal reasoning by enabling agents to query environment states at arbitrary spatial locations and time steps, supporting more sophisticated planning and scene understanding than purely predictive models alone.
A new generation of world models extends beyond video prediction and environment simulation to generate editable 3D scenes with production-ready properties. Models such as HYWorld 2.011) focus on engine-readiness by generating assets with proper topology, UV mapping, and rigging. These capabilities allow AI-generated 3D environments and objects to be directly integrated into production pipelines for gaming, virtual reality, and digital content creation without requiring manual asset refinement. This represents a significant practical advance in world models, bridging the gap between generative simulation and real-time interactive applications where geometric and material properties must meet strict engineering standards.
World models enable several planning strategies:
World models are trained by minimizing a composite loss over predicted observations, rewards, and latent state distributions:
$$\mathcal{L}_{\text{world}} = \mathbb{E}\left[\sum_{t=1}^{T}\left(\underbrace{\|o_t - \hat{o}_t\|^2}_{\text{reconstruction}} + \underbrace{(r_t - \hat{r}_t)^2}_{\text{reward prediction}} + \underbrace{D_{\text{KL}}(q(z_t|o_t) \| p(z_t|h_t))}_{\text{latent regularization}}\right)\right]$$
The KL term encourages the prior (imagination) distribution to match the posterior (observation-conditioned) distribution, ensuring that imagined rollouts remain faithful to real dynamics.
For embodied agents, world models bridge simulation and reality through a continuous learning cycle:
Advanced world agents maintain structured beliefs about environments and other agents: