David Romero

I’m a PhD student in Computer Vision at MBZUAI, working with Ivan Laptev. Previously I worked with Thamar Solorio at the Ritual Lab in Vision-Language models. Before that, I worked in NLP and Speech processing topics at the Ixa Research Group at the University of the Basque Country, working with Eneko Agirre and at the Universidad Politecnica de Madrid with Luis Fernando D’Haro. I am an Electronic Engineer and I’m originally from Ecuador 🇪🇨.

My current research is focused on large-scale difussion models for world modeling. I’m also interested in Vision-Language Models for Video Understanding.

News

Oct 31, 2025	My poster presentation on KineMask has won the second best poster award at the MBZUAI Research Showcase Event 2025.
Mar 15, 2025	I gave a talk on CVQA at the Adapt Centre, School of Computing, Dublin City University.
Dec 15, 2024	Microsoft has promoted my work CVQA in a blog and a podcast
Oct 15, 2024	CVQA has been accepted to Neurips 2024 Datasets and Benchmarks as an ORAL paper - Top 1%.
Aug 15, 2024	I started my PhD at MBZUAI.
Mar 15, 2024	I gave a talk on Vision-Language Models for Video Understanding at the International Research Experience for Students IRES - University of Houston (UH) and INAOE - 2024
Feb 15, 2024	Q-ViD has been accepted to ACL-Findings 2024.
Sep 15, 2023	I’ve joined the Ritual Lab to work with Thamar Solorio.
Jun 15, 2023	I am one of the authors of a book called “Tos por COVID-19. Caracterización desde la inteligencia artificial” that recently came out.
Nov 01, 2022	I won the HAP-LAP scholarship given by the Ixa Research Group at the University of the Basque Country.
Oct 01, 2022	My paper has been acepted to ICASSP 2022.

Selected publications

Learning to Generate Object Interactions with Physics-Guided Video Diffusion

David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, and Ivan Laptev

In arXiv , 2025

PDF
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, and Haryo Akbarianto Wibowo et-al.

In NeurIPS - ORAL Top 1% , 2024

PDF Video
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

David Romero and Thamar Solorio

In ACL Findings , 2024

PDF
Phonotactic Language Recognition Using A Universal Phoneme Recognizer and A Transformer Architecture

David Romero, Luis Fernando D’Haro, Marcos Estecha-Garitagoitia, and Christian Salamea

In ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

DOI PDF

Professional Service

Reviewer: ICLR 2026, NeurIPS 2025, ICCV 2025, EMNLP 2024.

Talks

Virtual Talk - CVQA Adapt Centre, School of Computing, Dublin City University - 2024
Virtual Talk - Vision-Language Models for Video Understanding International Research Experience for Students IRES - University of Houston (UH) and INAOE - 2024