World Models vs. Large Language Models
Core Architecture
Dreamer Architecture
Voyager: LLM-Powered World Knowledge
LLMs as World Models
Foundation Models and World Engines
Spatial-Temporal Scene Reconstruction
3D World Models and Production-Ready Assets
Imagination-Based Planning
Prediction Loss
Sim-to-Real Transfer
Collaborative and Multi-Agent World Models
See Also
References

World Models for Agents

World models are internal computational simulators that allow AI agents to predict the next state of a dynamic system, represent underlying physics and causal relationships, and plan strategies without directly interacting with the real world¹⁾. An agent world model serves as an internal simulator — enabling imagination-based planning where agents reason over future states before committing to actions, moving AI from being a narrator to a competent operator in physical environments²⁾. These integrated systems enable better decision-making through predictive simulation rather than direct reasoning improvement³⁾. World models represent a critical frontier in Physical AI, which moves beyond low-bandwidth language abstractions to interact with the physical world by understanding spatial geometry and 4D reality⁴⁾.

World Models vs. Large Language Models

While Large Language Models function as brilliant narrators that excel at text prediction and abstraction, world models represent ground truth of physics and causality⁵⁾. LLMs operate as low-bandwidth abstractions of reality, effective for language generation but limited in their ability to understand spatial geometry and causal mechanics. World models, by contrast, function as competent operators that directly model physical phenomena and state transitions. This distinction marks a fundamental shift in AI development: from the era of pure token prediction toward the era of physical simulation and embodied reasoning about dynamic environments.

Core Architecture

A world model typically combines several components:

Transition model: Predicts how environmental state changes given an action, $p(s_{t+1} | s_t, a_t)$
Observation model: Determines what the agent perceives in each state, $p(o_t | s_t)$
Reward predictor: Estimates expected reward for state-action pairs, $\hat{r}(s_t, a_t)$
Latent state encoder: Compresses high-dimensional observations into compact latent representations $z_t = \text{enc}(o_t)$

The agent can then “dream”, simulate trajectories within the learned world model to evaluate plans without costly real-world interaction.

Dreamer Architecture

The Dreamer family (V1, V2, V3) by Danijar Hafner et al. represents the most successful line of world-model-based RL agents.

DreamerV3⁶⁾ (Nature, 2025) achieves mastery across 150+ diverse tasks with a single configuration:

Learns a Recurrent State-Space Model (RSSM) with deterministic state $h_t$ and stochastic state $z_t$:

$$h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1}), \quad z_t \sim q_\theta(z_t | h_t, o_t)$$

Imagines future trajectories in latent space by rolling out the prior: $\hat{z}_t \sim p_\theta(\hat{z}_t | h_t)$
Trains actor and critic entirely within imagined rollouts, optimizing:

$$\mathcal{J}_{\text{actor}}(\psi) = \mathbb{E}_{\text{imagine}}\left[\sum_{t=0}^{H} \[[gamma|gamma]]^t \hat{r}_t\right]$$

Uses symlog normalization and percentile-based scaling for robustness

Key achievement: DreamerV3 was the first algorithm to collect a diamond in Minecraft from scratch without human demonstrations, a long-horizon task requiring hundreds of sequential decisions across multiple subgoals.

Simplified Dreamer imagination loop
class WorldModel:
    def __init__(self, rssm, reward_head, decoder):
        self.rssm = rssm
        self.reward_head = reward_head
        self.decoder = decoder
 
    def imagine(self, initial_state, policy, horizon=15):
        """Generate imagined trajectory for planning."""
        states, rewards = initial_state, []
        state = initial_state
        for t in range(horizon):
            action = policy(state)
            state = self.rssm.predict_next(state, action)
            reward = self.reward_head(state)
            states.append(state)
            rewards.append(reward)
        return states, rewards

Voyager: LLM-Powered World Knowledge

Voyager⁷⁾ (NVIDIA, 2023) takes a fundamentally different approach, using an LLM as the world model and planner for an embodied agent in Minecraft:

Automatic curriculum: LLM proposes progressively harder exploration goals
Skill library: Stores executable code snippets for complex behaviors, retrieved and composed for novel situations
Environment feedback: Execution results, game state, and error messages feed back to the LLM

Unlike Dreamer's learned latent dynamics, Voyager leverages the LLM's pretrained world knowledge. It continuously discovers new skills without human intervention, demonstrating lifelong learning in an open-ended environment.

LLMs as World Models

Recent research (2025-2026) demonstrates that LLMs can serve directly as environment simulators:

Fine-tuned Qwen2.5-7B and Llama-3.1-8B achieved >99% accuracy predicting state transitions in ALFWorld and ~98.6% in SciWorld
Even without fine-tuning, Claude Sonnet achieved 77% accuracy with just 3 examples
Architecture: One “World Model LLM” simulates the environment while a separate “Agent LLM” plans and acts

This decoupled approach enables training agents in simulated environments generated by LLMs, dramatically reducing the cost of environment interaction.

Foundation Models and World Engines

NVIDIA Cosmos represents a new class of world models: foundation models designed to serve as physics engines for large-scale synthetic data generation⁸⁾. Cosmos compresses spatiotemporal reality into tokenized representations that enable the model to process, predict, and generate complex physical phenomena. Such world foundation models are central to the industry's broader shift toward grounding machine intelligence in the physical world, enabling large-scale generation of synthetic training environments without explicit physics simulators.

Genie 3 (Google DeepMind, August 2025) represents a major breakthrough in generative world models by producing fully interactive, playable environments from a single image⁹⁾. Unlike purely predictive world models, Genie 3 generates action-controllable 3D environments and simulates realistic physics within them, enabling agents to train in dynamically generated worlds. This positions Genie 3 as a milestone toward world-model-based AGI, demonstrating that world models can be constructive simulators rather than passive observers.

Spatial-Temporal Scene Reconstruction

D4RT (DeepMind) represents an architectural leap in world model design by reconstructing dynamic 4D environments through unified perception and tracking¹⁰⁾. Rather than predicting future states frame-by-frame, D4RT provides a highly parallelized, queryable interface for understanding how environments evolve across spatial and temporal dimensions. This approach advances spatiotemporal reasoning by enabling agents to query environment states at arbitrary spatial locations and time steps, supporting more sophisticated planning and scene understanding than purely predictive models alone.

3D World Models and Production-Ready Assets

A new generation of world models extends beyond video prediction and environment simulation to generate editable 3D scenes with production-ready properties. Models such as HYWorld 2.0¹¹⁾ focus on engine-readiness by generating assets with proper topology, UV mapping, and rigging. These capabilities allow AI-generated 3D environments and objects to be directly integrated into production pipelines for gaming, virtual reality, and digital content creation without requiring manual asset refinement. This represents a significant practical advance in world models, bridging the gap between generative simulation and real-time interactive applications where geometric and material properties must meet strict engineering standards.

Imagination-Based Planning

World models enable several planning strategies:

Forward rollout: Simulate action sequences, select the one with highest cumulative predicted reward: $a_{0:H}^* = \arg\max_{a_{0:H}} \sum_{t=0}^{H} \[[gamma|gamma]]^t \hat{r}(s_t, a_t)$
Model Predictive Control (MPC): Re-plan at every step using the latest state estimate
Tree search: Explore branching futures (MCTS-style) within the world model
Latent planning: Optimize action sequences directly in latent space via gradient descent on $\nabla_{a_{0:H}} \sum_t \hat{r}(s_t, a_t)$

Prediction Loss

World models are trained by minimizing a composite loss over predicted observations, rewards, and latent state distributions:

$$\mathcal{L}_{\text{world}} = \mathbb{E}\left[\sum_{t=1}^{T}\left(\underbrace{\|o_t - \hat{o}_t\|^2}_{\text{reconstruction}} + \underbrace{(r_t - \hat{r}_t)^2}_{\text{reward prediction}} + \underbrace{D_{\text{KL}}(q(z_t|o_t) \| p(z_t|h_t))}_{\text{latent regularization}}\right)\right]$$

The KL term encourages the prior (imagination) distribution to match the posterior (observation-conditioned) distribution, ensuring that imagined rollouts remain faithful to real dynamics.

Sim-to-Real Transfer

For embodied agents, world models bridge simulation and reality through a continuous learning cycle:

Simulation-based training: Agents practice policies within learned world models or physics simulators before deployment to physical hardware
Data bottleneck solution: Millions of failures and iterations can occur safely in simulation, addressing the critical constraint of costly real-world robot interaction
Physics grounding: World models provide physics-grounded training environments that support embodied agent development across factory automation, warehouse robotics, and autonomous vehicles
Domain randomization: Transfer policies to real robots by training on diverse simulated dynamics
Reality gap modeling: Continuous feedback loops allow agents to adapt learned behaviors to account for systematic differences between simulated and real environments
NVIDIA Omniverse and similar platforms: Generate diverse synthetic training data at scale using world models
Iterative refinement: Embodied agents can incrementally improve performance by alternating between simulation training and real-world deployment cycles

Collaborative and Multi-Agent World Models

Advanced world agents maintain structured beliefs about environments and other agents:

Collaborative Belief Worlds (CBW): Each agent tracks physical facts (zero-order beliefs) and models of collaborators' mental states (first-order beliefs)
Aspective Agentic AI: Partitions agents into information-based aspects where each observes only “their world”
Enables intent-aware communication under partial observability