MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation

Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions.

From video generator to interactive world simulator

PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps.

This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions.

GLP architecture, separating what happens from how it looks

The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state.

In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision.

The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text.

Causal Swin DPM and sliding window diffusion

Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention.

The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons.

PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout.

Training stack and data construction

PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs.

Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events.

Benchmarks, action fidelity, long horizon stability, planning

The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3.

For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines.

For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics.

For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In step wise simulation, PAN achieves 56.1% accuracy, the best among open source world models.

Key Takwaways

PAN implements the Generative Latent Prediction architecture, combining a Qwen2.5-VL-7B based latent dynamics backbone with a Wan2.1-T2V-14B based video diffusion decoder, to unify latent world reasoning and realistic video generation.

The Causal Swin DPM mechanism introduces a sliding window, chunk wise causal denoising process that conditions on partially noised past chunks, which stabilizes long horizon video rollouts and reduces temporal drift compared to naive last frame conditioning.

PAN is trained in two stages, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a flow matching objective, then jointly training the GLP stack with a frozen Qwen2.5-VL backbone and learned query embeddings plus decoder.

The training corpus consists of large scale video action pairs from diverse domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to learn action conditioned, long range dynamics instead of isolated short clips.

PAN achieves state of the art open source results on action simulation fidelity, long horizon forecasting, and simulative planning, with reported scores such as 70.3% agent simulation, 47% environment simulation, 53.6% transition smoothness, and 64.1% simulation consistency, while remaining competitive with leading commercial systems.

Comparison Table

DimensionPANCosmos video2world WFMWan2.1 T2V 14BV JEPA 2OrganizationMBZUAI Institute of Foundation ModelsNVIDIA ResearchWan AI and Open LaboratoryMeta AIPrimary roleGeneral world model for interactive, long horizon world simulation with natural language actionsWorld foundation model platform for Physical AI with video to world generation for control and navigationHigh quality text to video and image to video generator for general content creation and editingSelf supervised video model for understanding, prediction and planning tasksWorld model framingExplicit GLP world model, latent state, action, and next observation defined, focuses on simulative reasoning and planningDescribed as world foundation model that generates future video worlds from past video and control prompt, aimed at Physical AI, robotics, driving, navigationFramed as video generation model, not primarily as world model, no persistent internal world state described in docsJoint embedding predictive architecture for video, focuses on latent prediction rather than explicit generative supervision in observation spaceCore architectureGLP stack, vision encoder from Qwen2.5 VL 7B, LLM based latent dynamics backbone, video diffusion decoder with Causal Swin DPMFamily of diffusion based and autoregressive world models, with video2world generation, plus diffusion decoder and prompt upsampler based on a language modelSpatio temporal variational autoencoder and diffusion transformer T2V model at 14 billion parameters, supports multiple generative tasks and resolutionsJEPA style encoder plus predictor architecture that matches latent representations of consecutive video observationsBackbone and latent spaceMultimodal latent space from Qwen2.5 VL 7B, used both for encoding observations and for autoregressive latent prediction under actionsToken based video2world model with text prompt conditioning and optional diffusion decoder for refinement, latent space details depend on model variantLatent space from VAE plus diffusion transformer, driven mainly by text or image prompts, no explicit agent action sequence interfaceLatent space built from self supervised video encoder with predictive loss in representation space, not generative reconstruction lossAction or control inputNatural language actions in dialogue format, applied at every simulation step, model predicts next latent state and decodes video conditioned on action and historyControl input as text prompt and optionally camera pose for navigation and downstream tasks such as humanoid control and autonomous drivingText prompts and image inputs for content control, no explicit multi step agent action interface described as world model controlDoes not focus on natural language actions, used more as visual representation and predictor module inside larger agents or plannersLong horizon designCausal Swin DPM sliding window diffusion, chunk wise causal attention, conditioning on slightly noised last frame to reduce drift and maintain stable long horizon rolloutsVideo2world model generates future video given past window and prompt, supports navigation and long sequences but the paper does not describe a Causal Swin DPM style mechanismCan generate several seconds at 480 P and 720 P, focuses on visual quality and motion, long horizon stability is evaluated through Wan Bench but without explicit world state mechanismLong temporal reasoning comes from predictive latent modeling and self supervised training, not from generative video rollouts with explicit diffusion windowsTraining data focusLarge scale video action pairs across diverse physical and embodied domains, with segmentation, filtering and dense temporal recaptioning for action conditioned dynamicsMix of proprietary and public Internet videos focused on Physical AI categories such as driving, manipulation, human activity, navigation and nature dynamics, with a dedicated curation pipelineLarge open domain video and image corpora for general visual generation, with Wan Bench evaluation prompts, not targeted specifically at agent environment rolloutsLarge scale unlabelled video data for self supervised representation learning and prediction, details in V JEPA 2 paper

PAN is an important step because it operationalizes Generative Latent Prediction with production scale components such as Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on well defined benchmarks for action simulation, long horizon forecasting, and simulative planning. The training and evaluation pipeline is clearly documented by the research team, the metrics are reproducible, and the model is released within a transparent world modeling framework rather than as an opaque video demo. Overall, PAN shows how a vision language backbone plus diffusion video decoder can function as a practical world model instead of a pure generative toy.

Check out the Paper, Technical details and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Max is an AI analyst at MarkTechPost, based in Silicon Valley, who actively shapes the future of technology. He teaches robotics at Brainvyne, combats spam with ComplyEmail, and leverages AI daily to translate complex tech advancements into clear, understandable insights

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link