Learning to Generate Object Interactions
with Physics-Guided Video Diffusion

David Romero1, Ariana Bermudez1, Hao Li1,2, Fabio Pizzati1, Ivan Laptev1

1MBZUAI UAE, 2Pinscreen USA

Paper Code (Coming Soon) Data (Coming Soon)

Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce , an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that our method achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low-and high-level conditioning in VDMs.

• We enable object-based control with a novel training strategy. Paired with synthetic data constructed for the task, enables pretrained diffusion models to synthesize realistic object interactions in real-world input scenes.

Research overview visualization

• We encode our low-level control signal as a mask encoding the instantaneous velocity of the moving objects for each frame, to train a ControlNet (left) in two stages using Blender-generated videos of objects in motion. In the first one, we train with all frames, whereas in the second one, we randomly drop part of the final frames. We also provide a high-level textual control extracted by a VLM. At inference (right), we construct the low-level conditioning with SAM and use GPT to infer high-level outcomes of object motion from a single frame.

Research overview visualization

• We generate a synthetic dataset using Blender. We render cubes and cylinders with random colors, moving on textured backgrounds interacting with each other. We show our synthetic dataset along with the velocity masks for each case. Please note that the masks are extracted only for objects that move at the beginning of the video, these change their color and intensity depending on motion direction and velocity magnitude.

can perform low-level motion control, moving different objects on different degrees of freedom and speed. Despite being trained on basic synthetic data, it generalizes motion control to real-world scenes. Multiple objects control examples coming soon !!!

Direction
Speed
Object

trained on object interactions and tested with three different velocities on a real scene. As velocity increases, the resulting interactions also change, indicating that the model captures the causal structure of motion. In particular, the final position of the second television varies with the velocity of the first, moving further if the first hits it at higher speeds. This property is valuable for world modeling, as it enables the analysis of different outcomes of object interactions and supports informed planning.

Low Velocity
Medium Velocity
High Velocity

generates realistic interactions with other objects if they are present in the path of motion of the initial moving object showing a correct understanding of rigid body dynamics. We also show complex interactions that require implicit 3D understanding, making a pot or a glass of juice fall and crash as a result of motion, and multi-object motion and interactions.(Examples coming soon!!). We preserve the input motion direction and object consistency in different types of real-world scenarios, showing a strong generalization of the knowledge acquired from simulated videos.

Comparison of our Method with Other Baselines

Controls: ▶️ Play all videos ⏸️ Pause all videos 🔄 Reset all videos to start Each row has individual play/pause/reset buttons