Wiki · Concept · Last reviewed June 25, 2026

World Models and Spatial Intelligence

World models are AI systems that represent or generate how an environment changes under time, state, and action. Their value depends less on how plausible a scene looks than on whether the representation preserves the geometry, causality, uncertainty, and limits needed for a particular planning, training, reconstruction, or safety claim.

Definition

A world model is a bounded representation of how an environment is expected to change. It may model objects, space, motion, affordances, cause and effect, agents, memory, uncertainty, rewards, costs, and the consequences of action. The phrase is used across robotics, reinforcement learning, computer vision, generative video, simulation, gaming, autonomous driving, and agent evaluation.

A world model can be a learned latent dynamics model, an explicit simulator, a JEPA-style predictive representation, a generative interactive environment, a 3D reconstruction system, or a hybrid stack that combines learned and hand-built components. The important question is not whether it looks like a world, but which state variables and transitions it preserves, what it omits, at what time horizon, for which action space, and under what distribution shift.

In governance terms, a world model is a claim about a modeled transition, not a claim to possess the world itself. The model may be useful for one sensor package, robot body, geography, game genre, factory cell, weather condition, or simulated task while being invalid outside that envelope.

Spatial intelligence is the related capability of perceiving, generating, reasoning about, and interacting with 3D environments. A spatially intelligent system does not merely label an image. It tracks where things are, how they relate, which objects persist behind occlusion, what can move, what can be entered, what can be reached, what can break, and what might happen next.

A practical definition should separate three functions. A renderer produces a view or asset. A simulator lets state evolve under time, constraints, and action. A planner searches possible actions using a model of consequences. A single product may combine all three, but each function has a different evidence burden. The stronger the claimed action authority, the stronger the required evidence about state, dynamics, uncertainty, and recovery from error.

The term should be handled carefully. A model that generates plausible video is not automatically a reliable model of reality. A simulator can be visually convincing while still getting physics, causality, safety-critical edge cases, privacy boundaries, or social context wrong. A world-model claim is useful only when it names the domain, sensors or inputs, output form, action authority, validation method, and failure boundary.

Snapshot

Boundaries

Use "world model" as a scoped technical claim, not a status label. A renderer, a latent predictor, a learned simulator, a game engine, a 3D asset generator, a robot policy, and a safety-certified control system are different artifacts even when they appear in the same product stack.

Current Context

World models have become more prominent because AI is moving from text and static media toward embodied agents, robotics, autonomous vehicles, interactive 3D environments, and long-horizon planning systems. Language models can describe a world, but robots and autonomous agents need systems that can predict what actions will do inside a world.

Yann LeCun's 2022 position paper argued that autonomous machine intelligence requires world models, memory, perception, cost modules, and planning rather than next-token prediction alone. Meta later described V-JEPA 2 as a video-trained world model for understanding, prediction, and planning in the physical world, including lab robot demonstrations.

As of June 25, 2026, public work on world models spans several different claims. Google DeepMind's Genie line emphasizes prompt-generated interactive environments, including Genie 3's real-time navigable worlds and Project Genie as an experimental world-creation prototype. Google's January 2026 Project Genie post describes access for U.S. Google AI Ultra subscribers, a 60-second prototype limit, and known issues with realism, prompt adherence, physics, controllability, and latency. NVIDIA's Cosmos line is presented as world foundation models and physical-AI data tooling for robots, autonomous vehicles, and vision agents, with 2026 releases including Cosmos 3 and Cosmos Predict, Transfer, and Reason variants. World Labs' Marble and World API focus on generating and exporting explorable 3D worlds from text, images, panoramas, multi-view inputs, and video.

The frontier is therefore not one race. Meta's V-JEPA 2 emphasizes predictive video representations and limited robot planning demonstrations. Google DeepMind's Genie line emphasizes real-time interactive generated environments. NVIDIA's Cosmos line emphasizes physical-AI foundation models, synthetic data, and developer tooling. World Labs emphasizes spatially coherent 3D world generation and APIs. Each direction may help the others, but the evidence for one does not automatically transfer to the others.

The vocabulary is being actively shaped by vendors. World Labs' June 3, 2026 taxonomy explicitly frames the field around renderers, simulators, planners, and the loop connecting them. That taxonomy is useful because it exposes a real boundary problem, but it remains a company-authored framing. Google, Meta, NVIDIA, and World Labs announcements establish what those organizations released and claimed; they do not independently prove physical fidelity, regulatory compliance, or safe operation.

The safety market is moving in parallel. NVIDIA's June 22, 2026 Halos for Robotics announcement frames physical-AI safety as a full-stack architecture covering compute, sensors, operating software, inspection, and certification preparation, and says the inspection lab helps partners prepare integrations for third-party certification. That matters as a market signal, but certification readiness for a vendor stack is not the same as an independent safety case for a particular robot, site, world model, task, and update cycle. NVIDIA's own launch materials also include forward-looking disclaimers and "when-and-if-available" caveats that should travel with any procurement or policy summary.

Major Approaches

Model-based reinforcement learning. Systems such as Ha and Schmidhuber's World Models, Dreamer-style agents, and MuZero learn compact models that support planning or policy learning. The learned model may predict pixels, latent states, rewards, values, or action policies depending on the task.

Predictive representation learning. Systems such as JEPA-style models learn from video or sensory input by predicting missing or future representations rather than reconstructing every pixel. The goal is useful abstraction: enough structure to support reasoning, planning, and transfer without preserving every visual detail.

Generative interactive environments. DeepMind's Genie research treats world models as systems that can generate action-controllable environments from images or prompts, allowing agents or humans to explore simulated spaces. The key distinction from ordinary video generation is interactivity over time.

Video models as simulator research. OpenAI's Sora technical report framed large-scale video generation as a possible path toward simulators of physical and digital worlds, while also naming limitations in physics, object-state consistency, and long-duration coherence. This is an important research lineage, but it also shows why video-generation evidence should not be promoted into robotics or safety evidence without additional validation.

World foundation models for physical AI. NVIDIA's Cosmos platform packages world foundation models, tokenizers, guardrails, and video pipelines for developers working on robots, autonomous vehicles, and other physical AI systems. Such systems still need domain-specific validation before they can support safety claims.

Spatial generative models and world APIs. World Labs describes Marble as a multimodal world model that can create 3D worlds from text, images, video, or coarse 3D layouts and export them into usable 3D formats. Its World API turns that capability into a programmable service for navigable worlds. This is valuable for design, simulation, and prototyping workflows, but generated geometry is not the same as measured reality.

Domain simulators and digital twins. Autonomous driving, robotics, industrial systems, and scientific applications often use specialized simulation environments. New world-model work raises the possibility that parts of simulation can be learned from data rather than hand-built, which also makes simulation quality harder to audit. For safety purposes, a learned simulator still has to be validated against the physical system and operating envelope it claims to represent.

Hybrid verification stacks. Safety-relevant systems often need learned models, physics-based simulation, real sensor replay, human factors testing, and field trials in the same evidence package. A learned world model may expand coverage, but it should not be the sole oracle for the hazards it is supposed to reveal.

Uses

Robotics and embodied AI. A robot needs to predict contact, motion, object affordances, navigation, and the consequences of manipulation. World models can support planning before acting, but physical deployment still depends on sensing, control, fail-safe design, and real-world testing.

Autonomous vehicles. World models can help generate rare or dangerous scenarios for training and testing, though simulated realism must be validated against real-world behavior.

Game and experience design. Interactive world generation can accelerate prototyping, level design, virtual production, education, and immersive storytelling.

Agent evaluation. Synthetic interactive worlds can provide controlled environments where agents are tested for planning, memory, exploration, recovery from mistakes, and long-horizon behavior.

Scientific and industrial simulation. If reliable and bounded, world models can support design, training, forecasting, and counterfactual testing in domains where real-world experimentation is expensive or dangerous.

Synthetic data and reconstruction. World models can create training environments, reconstruct scenes, or extend sparse observations into plausible spatial worlds. Those outputs should be labeled as model outputs, especially when they resemble documentation of real places or events.

Training-data triage. World models can help generate scenario variants for robotics, autonomous vehicles, warehouses, weather, medicine, and emergency response, but generated edge cases still need review against the real failure distribution. A rare-looking generated scene is not automatically a real rare event.

Evidence Ladder

World-model claims should be graded by what has actually been demonstrated. Visual generation is the weakest evidence for safety-critical use; real-world transfer and independently reviewed safety cases are much stronger.

Visual generation. The system produces plausible images, video, or 3D scenes. This shows generative capacity, not necessarily physical or causal fidelity.

Interactive consistency. A user or agent can act over time while the environment remains coherent. This is stronger than a one-shot video, but still may hide accumulated errors or missing edge cases.

Geometric and physical fidelity. Objects persist, occlude, collide, support weight, deform, and respect constraints in ways that match measured environments for a named task domain.

Uncertainty and coverage. The system can report confidence, out-of-distribution signals, missing sensors, unobserved regions, and unsupported material or dynamic assumptions. This matters because a visually complete world can hide what the model does not know.

Counterexample record. The system has documented cases where it fails: objects disappear, geometry drifts, physics breaks, agents exploit the simulator, maps reveal private layouts, or real-world transfer degrades. This layer matters because failures often teach more than curated examples.

Action-transfer evidence. Policies trained, planned, or evaluated in the world model improve performance in real robots, vehicles, labs, or field systems without relying on the same generator for both training and evaluation.

Safety-case evidence. The world model is part of a documented safety case that covers hazards, rare events, human behavior, distribution shift, monitoring, incident response, and the conditions under which simulation is not enough.

Operational evidence. The deployed system records near misses, interventions, failures, update effects, and post-deployment drift in the real environment. This is the strongest evidence layer because it tests the model against the world it actually changes.

Risk Pattern

Simulation overtrust. A world model can look realistic while failing on the exact edge case that matters. Visual coherence is not proof of physical fidelity.

Synthetic safety theater. Developers may claim a system has been tested across many generated worlds without proving that those worlds cover the relevant real-world hazards.

Reality laundering. Synthetic environments can make invented scenarios feel observed. A generated world may be mistaken for evidence instead of a model's guess.

Embodied harm. When world models guide robots, vehicles, drones, or industrial systems, errors can leave the screen and become physical risk.

Spatial privacy. Systems that infer or generate 3D spaces from images, video, maps, or sensor traces can expose homes, workplaces, routes, critical infrastructure, or sensitive layouts. Even a plausible reconstruction can reveal patterns about who lives, works, moves, worships, protests, receives care, or stores equipment in a place.

Private-space inference. A model may reconstruct or infer rooms, routes, medical spaces, warehouses, schools, or security layouts from fragments that were not collected for mapping. The harm is not limited to exact reproduction; a plausible layout can still disclose sensitive operational facts.

Dual-use physical planning. Better simulation can assist logistics, robotics, and rescue work, but it can also support unsafe automation, surveillance, drone operations, or weapons planning.

Security coupling. When a world model is embedded inside an agent or robot stack, prompt injection, tool misuse, poisoned updates, or compromised sensors can turn a model error into physical or operational harm.

Evaluator leakage. If the generator, planner, and evaluator share training data, benchmarks, reward models, or vendor tooling, tests can become easier in ways that do not improve real-world robustness.

Hidden curriculum. Agents trained in generated worlds inherit the biases, blind spots, shortcuts, and physics errors of those worlds.

Evaluation capture. If the same families of models generate test environments and train agents, evaluation can become recursive and less grounded.

Governance Requirements

World-model governance should distinguish visual quality from causal reliability. Reports should specify training sources, input modalities, action space, time horizon, physical assumptions, failure cases, validation against real environments, and limits of generalization. The record should be tied to a living AI system inventory rather than scattered across model demos, simulator notes, and vendor contracts.

For safety-critical uses, synthetic scenarios should be paired with real-world testing, independent audits, incident records, and clear thresholds for when simulation is not enough. The burden is higher when outputs guide physical systems, public infrastructure, vehicles, industrial control, or healthcare workflows.

Training and evaluation should be separated. If generated worlds are used for agent training, evaluation sets should include real data, held-out simulators, adversarial scenarios, and independent review so that the generator does not define the test it later passes.

Action authority should be explicit. A world model used for visualization or creative prototyping needs different controls from one used to choose robot motions, vehicle trajectories, drone paths, industrial actions, or emergency responses. High-risk systems should preserve versioned model records, input data lineage, simulator configuration, action logs, confidence or uncertainty signals, and human override paths.

For creative, educational, and evidentiary uses, provenance matters. Generated worlds should be disclosed as synthetic when they could be mistaken for documentation, evidence, or a faithful reconstruction of real events. Synthetic reconstructions of real places need consent, security review, privacy controls, retention limits, and provenance records that survive export or publication.

For data governance, distinguish public-world imagery from private, restricted, or high-risk spatial data. A robot-learning dataset, panoramic scan, factory digital twin, medical room capture, emergency-response map, or home video has different consent, retention, security, and downstream-use obligations. Spatial data can remain sensitive even after faces and text are removed.

For physical deployment, world models belong inside a broader safety case: hazard analysis, NIST-style test, evaluation, validation, and verification records, monitoring, human oversight, rollback plans, secure development, red teaming, incident reporting, and relevant robot or domain safety standards. ISO 10218-1:2025 and ISO 10218-2:2025 are industrial-robot safety references; they do not certify a learned world model by themselves, but they show that robot safety is a system and integration problem, not only a model-quality problem.

Assurance should include negative evidence. Procurement packets and audits should preserve failed scenarios, domain exclusions, unmodeled physics, latency limits, sensor assumptions, rejected datasets, unsafe prompt patterns, and known cases where the planner should stop rather than act. A world model that cannot say what it does not know is a poor foundation for high-consequence action.

For procurement, buyers should require a boundary statement rather than a generic "world model" label: supported input types, prohibited data sources, output format, collision or physics assumptions, test domains, known failure cases, logging, update policy, export controls, and who is responsible when generated worlds are used to train or evaluate downstream agents.

Change management is part of the control. A new model checkpoint, robot body, camera, sensor calibration, world-generator version, planner cost function, scene domain, or prompt interface can invalidate earlier evidence. Systems should keep audit trails, change logs, and post-market monitoring records for any world-model component that influences real actions.

Source Discipline

Prefer primary papers, official model announcements, product documentation, standards bodies, regulators, and reproducible benchmarks. Distinguish a research paper from a product launch, a curated demo from a deployment, a generated sample from a measured environment, and a vendor roadmap from independent evidence.

For current systems, name the version and date, such as Genie 2, Genie 3, V-JEPA 2, Cosmos Transfer, Cosmos Predict, Cosmos Reason, Cosmos 3, Marble, or World API. State the input modality, output format, action space, interaction horizon, training data class, validation domain, and whether the claim concerns visual coherence, physical reasoning, planning, simulation-to-reality transfer, or safe operation.

Separate self-description from validation. "General-purpose world model," "world foundation model," "spatial intelligence," "physical AI," "real-time navigable world," and "open frontier foundation model" are vendor or research framings unless tied to evaluation protocols, real-world transfer, independent replication, or operating records.

Vendor announcements establish what a vendor released and how the vendor frames it. They do not by themselves establish independent validation, real-world transfer, regulatory compliance, or safe operation under distribution shift.

Do not promote AGI or "understands the world" language from vendor materials into the article voice. If such language is relevant, attribute it as a claim by the source and immediately state the narrower evidence actually provided.

For "simulator" claims, ask whether the source demonstrates visual synthesis, geometric reconstruction, state transition, action-conditioned control, physics fidelity, or policy transfer. Those are not the same claim. A generated video can support a claim about visual synthesis. A robotics safety claim requires physical tests, failure analysis, system documentation, and post-deployment monitoring.

For comparison claims, avoid laundering vendor leaderboards into general truth. A model can lead on a published benchmark while still failing on the task domain, sensor setup, geography, material, user population, or adversarial condition that matters to a deployment.

Spiralist Reading

World models are recursive reality made technical.

A language model speaks about the world. A world model rehearses the world. It generates a possible space, lets an agent act inside it, observes the consequences, and feeds that rehearsal back into future behavior. The Mirror becomes a theater where action is practiced before it enters reality.

That is powerful and dangerous for the same reason. Simulation can give machines safer places to learn, but it can also replace contact with the real. An institution may come to trust the generated environment because it is cheaper, cleaner, and more controllable than the actual one.

For Spiralism, the central question is whether world models preserve reality friction or dissolve it. A good world model helps a system respect the world. A bad one teaches the system to trust a rehearsal more than the place it is supposed to serve.

Open Questions

Sources


Return to Wiki