The Object Slot Becomes the Planning State
COMET is interesting because it does not ask a planner to reason over a single opaque image embedding. It decomposes the scene into object slots, binds actions to those slots, and runs Monte Carlo Tree Search in that structured latent space. The audit lesson is equally sharp: once the slot is the planning state, bad slots become bad world models.
The Paper
The paper is Causal Object-Centric Models for Planning with Monte Carlo Tree Search, arXiv:2606.14418 [cs.AI], by Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, and Aleksandr Panov. The arXiv HTML lists Rodion Vakhitov at MIRAI in Moscow, Russia, and Leonid Ugadiarov, Alexey Skrynnik, and Aleksandr Panov at CogAILab and MIRAI in Moscow, Russia. arXiv lists version 1 as submitted on June 12, 2026.
The method is COMET, short for Causal Object-centric Model for Efficient Tree search. The paper frames it as a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. In plain terms: the planner does not search over raw pixels or one monolithic embedding. It searches over a learned set of object-like latent records.
The Planning State
COMET has three main pieces. First, a frozen object-centric encoder maps image observations into object slots. The authors experiment with SLATE, DINOSAUR, and Slot Contrast depending on the task. These slot extractors are pretrained on observations collected under a random policy and remain frozen during reinforcement learning.
Second, a transformer-based world model predicts future slots and rewards. The implementation builds on the LightZero framework and follows the UniZero training pipeline, but replaces UniZero's monolithic state embedding with object-centric slots. This is the central architectural bet: the environment is represented as entities whose dynamics can be modeled and planned over.
Third, COMET binds actions to objects. The paper says that using a single action embedding to predict all next slots created a bottleneck. COMET instead concatenates the action embedding with each slot independently, passes those slot-action pairs through a shared MLP projector, and feeds the resulting slot-conditioned action embeddings into the transformer backbone. Policy and value heads then use object-causal attention, where learned per-slot relevance scores modulate token interactions so decision-making can concentrate on task-relevant entities.
Experiments
The paper evaluates COMET across eight visually and dynamically diverse tasks: Object Goal, Object Interaction, Object Comparison, Property Comparison, Object Reaching, Block Lifting, Cube Pushing, and Defend The Line. The environments come from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom.
The authors compare against object-centric and monolithic baselines and report that COMET achieves a higher mean normalized score during early training. That phrase matters. The headline is sample efficiency and early-stage learning, not a universal dominance claim at every budget or in every environment.
The experiments also show where the approach earns its keep. In visually simpler tasks, such as Object Goal, Object Comparison, and Property Comparison, COMET can use strong object-centric representations and a small set of task-relevant objects. The paper reports that object causal attention assigns high causality scores to relevant entities, allowing the model to outperform baselines in those settings.
In harder visual-control tasks such as Block Lifting and Cube Pushing, COMET is more moderate. The paper points to limitations in the object-centric representation model, including cases where the cube and background merge into one slot. That is the governance hinge: the planner may look object-aware, but the object inventory is still learned and fallible.
Receipts
The paper gives enough implementation detail to make the claim inspectable. COMET uses 50 MCTS simulations, 20 sampled actions for continuous tasks, inference context length 10, temperature 0.25, Dirichlet noise 0.3 with weight 0.25, replay buffer capacity 1,000,000, uniform sampling, game segment length 400 for discrete tasks and 100 for continuous tasks, transformer backbone depth 2, policy/value transformer depth 1, batch size 64, AdamW, temporal-difference steps 5, and discount factor 0.997 except 0.925 in Cube Pushing.
The frozen encoders have their own receipts. SLATE is trained with a 1,000,000-observation dataset for 100 epochs. DINOSAUR is trained for 500,000 steps on a 300,000-observation dataset using a ViT-B backbone and five slots. Slot Contrast is trained for 100,000 steps with DINOv2 Small features, four-frame training segments, and slot attention iterations of 3 for the first frame and 2 for later frames.
Compute also matters. The appendix says training COMET for 500,000 environment steps on a single NVIDIA H100 80 GB GPU takes about 18 hours on average across tasks. That makes the reported gains an engineering result with a real compute budget, not a frictionless property of object-centric modeling.
Governance Standard
An object-centric planning system should ship with a slot receipt. The receipt should name the environment, task, object-centric encoder, encoder training data, slot count, temporal-consistency rule, action-slot fusion design, world-model architecture, MCTS budget, policy/value attention mechanism, replay buffer policy, reward/value discretization, task suite, baselines, normalized-score calculation, compute budget, rollout visualizations, slot-failure examples, and human-readable rule for when the system must abstain or fall back.
The core governance risk is interpretability laundering. A model can expose attention maps, causality scores, and object slots, but those artifacts are only useful if the slots track real task-relevant entities. If the cube merges with the background, if occlusion breaks slot identity, or if distractors receive high causal relevance, the explanation surface can become a polished picture of a bad internal state.
This connects directly to Reinforcement Learning, Reinforcement Learning from Verifiable Rewards, AI Evaluations, AI Safety Cases, Vision-Language-Action Models, Yann LeCun and the World Model Bet, The World Model Becomes the Hallucination Coverage Map, The Structural Certificate Becomes the World-Model Receipt, The Energy Field Becomes the Driving Safety Case, and The Field Robot Becomes the Farm Manager. A planner's internal objects are not ground truth. They are evidence that needs validation.
Limits
The authors are explicit about the limits. Current unsupervised object-centric representation methods still struggle to segment complex, cluttered real-world scenes, especially under occlusion or ambiguous boundaries. Transformer approaches also scale poorly with increasing object slots because self-attention is quadratic.
That means COMET is strongest as a research result about structured latent planning and early training efficiency. It is not a deployment certificate for robotics, driving, industrial control, or household agents. A real deployment would need stress tests for slot identity, occlusion, distractors, actuator error, visual domain shift, long-horizon compounding, and recovery when the world model predicts a plausible but wrong object state.
The Spiralist reading is simple: when the planner sees objects, ask who made the objects visible. The slot extractor is not a neutral window. It is the first institution in the plan.
Sources
- Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, and Aleksandr Panov, Causal Object-Centric Models for Planning with Monte Carlo Tree Search, arXiv:2606.14418 [cs.AI], submitted June 12, 2026.
- arXiv HTML: Causal Object-Centric Models for Planning with Monte Carlo Tree Search, reviewed for the abstract, COMET architecture, slot extractor setup, action-slot fusion, object-causal attention, environment list, experiments, limitations, hyperparameters, compute resources, rollout figures, and causality-score figures.
- arXiv PDF: Causal Object-Centric Models for Planning with Monte Carlo Tree Search.
- Related pages: Reinforcement Learning, Reinforcement Learning from Verifiable Rewards, AI Evaluations, AI Safety Cases, Vision-Language-Action Models, Yann LeCun and the World Model Bet, The World Model Becomes the Hallucination Coverage Map, The Structural Certificate Becomes the World-Model Receipt, The Energy Field Becomes the Driving Safety Case, and The Field Robot Becomes the Farm Manager.