1MBZUAI UAE, 2Pinscreen USA
an approach for physics-guided video generation that enables realistic rigid body control and interactions. Given a single image and an object velocity, KineMask generates videos with inferred motions and future object interactions, predicting dynamics from an input image and the generation of complex effects.
Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce , an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore,
integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that our method achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low-and high-level conditioning in VDMs.
• We enable object-based control with a novel training strategy. Paired with synthetic data constructed for the task, enables pretrained diffusion models to synthesize realistic object interactions in real-world input scenes.
• We encode our low-level control signal as a mask encoding the instantaneous velocity of the moving objects for each frame, to train a ControlNet (left) in two stages using Blender-generated videos of objects in motion. In the first one, we train with all frames, whereas in the second one, we randomly drop part of the final frames. We also provide a high-level textual control extracted by a VLM. At inference (right), we construct the low-level conditioning with SAM and use GPT to infer high-level outcomes of object motion from a single frame.
• We generate a synthetic dataset using Blender. We render cubes and cylinders with random colors, moving on textured backgrounds interacting with each other. We show our synthetic dataset along with the velocity masks for each case. Please note that the masks are extracted only for objects that move at the beginning of the video, these change their color and intensity depending on motion direction and velocity magnitude.
can perform low-level motion control, moving different objects on different degrees of freedom and speed. Despite being trained on basic synthetic data, it generalizes motion control to real-world scenes. Multiple objects control examples coming soon !!!
trained on object interactions and tested with three different velocities on a real scene. As velocity increases, the resulting interactions also change, indicating that the model captures the causal structure of motion. In particular, the final position of the second television varies with the velocity of the first, moving further if the first hits it at higher speeds. This property is valuable for world modeling, as it enables the analysis of different outcomes of object interactions and supports informed planning.
generates realistic interactions with other objects if they are present in the path of motion of the initial moving object showing a correct understanding of rigid body dynamics. We also show complex interactions that require implicit 3D understanding, making a pot or a glass of juice fall and crash as a result of motion, and multi-object motion and interactions.(Examples coming soon!!). We preserve the input motion direction and object consistency in different types of real-world scenarios, showing a strong generalization of the knowledge acquired from simulated videos.
@misc{romero2025learninggenerateobjectinteractions,
title={Learning to Generate Object Interactions with Physics-Guided Video Diffusion},
author={David Romero and Ariana Bermudez and Hao Li and Fabio Pizzati and Ivan Laptev},
year={2025},
eprint={2510.02284},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02284}
}