David Romero

foto_neurips_2.png

I’m a PhD student in Computer Vision at MBZUAI, working with Ivan Laptev. Previously I worked with Thamar Solorio at the Ritual Lab in Vision-Language models. Before that, I worked in NLP and Speech processing topics at the Ixa Research Group at the University of the Basque Country, working with Eneko Agirre and at the Universidad Politecnica de Madrid with Luis Fernando D’Haro. I am an Electronic Engineer and I’m originally from Ecuador 🇪🇨.

My current research is focused on large-scale difussion models for world modeling. I’m also interested in Vision-Language Models for Video Understanding.

News

Oct 31, 2025 My poster presentation on KineMask has won the second best poster award at the MBZUAI Research Showcase Event 2025.
Mar 15, 2025 I gave a talk on CVQA at the Adapt Centre, School of Computing, Dublin City University.
Dec 15, 2024 Microsoft has promoted my work CVQA in a blog and a podcast
Oct 15, 2024 CVQA has been accepted to Neurips 2024 Datasets and Benchmarks as an ORAL paper - Top 1%.
Aug 15, 2024 I started my PhD at MBZUAI.
Mar 15, 2024 I gave a talk on Vision-Language Models for Video Understanding at the International Research Experience for Students IRES - University of Houston (UH) and INAOE - 2024
Feb 15, 2024 Q-ViD has been accepted to ACL-Findings 2024.
Sep 15, 2023 I’ve joined the Ritual Lab to work with Thamar Solorio.
Jun 15, 2023 I am one of the authors of a book called “Tos por COVID-19. Caracterización desde la inteligencia artificial” that recently came out.
Nov 01, 2022 I won the HAP-LAP scholarship given by the Ixa Research Group at the University of the Basque Country.
Oct 01, 2022 My paper has been acepted to ICASSP 2022.

Selected publications

  1. kinemask.gif
    Learning to Generate Object Interactions with Physics-Guided Video Diffusion
    David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, and Ivan Laptev
    In arXiv , 2025
  2. cvqa_promo.png
    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
    David Romero, Chenyang Lyu, and Haryo Akbarianto Wibowo et-al.
    In NeurIPS - ORAL Top 1% , 2024
  3. qvid.png
    Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
    David Romero and Thamar Solorio
    In ACL Findings , 2024
  4. icasp.png
    Phonotactic Language Recognition Using A Universal Phoneme Recognizer and A Transformer Architecture
    David Romero, Luis Fernando D’Haro, Marcos Estecha-Garitagoitia, and Christian Salamea
    In ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

Professional Service

Reviewer: ICLR 2026, NeurIPS 2025, ICCV 2025, EMNLP 2024.

Talks

Virtual Talk - CVQA Adapt Centre, School of Computing, Dublin City University - 2024
Virtual Talk - Vision-Language Models for Video Understanding International Research Experience for Students IRES - University of Houston (UH) and INAOE - 2024