TR2025-049

Programmatic Video Prediction Using Large Language Models

- Tang, H., Ellis, K., Lohit, S., Jones, M.J., Chatterjee, M., "Programmatic Video Prediction Using Large Language Models", International Conference on Learning Representations Workshops (ICLRW), April 2025.
  BibTeX TR2025-049 PDF
  - @inproceedings{Tang2025apr,
  - author = {Tang, Hao and Ellis, Kevin and Lohit, Suhas and Jones, Michael J. and Chatterjee, Moitreya},
  - title = {{Programmatic Video Prediction Using Large Language Models}},
  - booktitle = {International Conference on Learning Representations Workshops (ICLRW)},
  - year = 2025,
  - month = apr,
  - url = {https://www.merl.com/publications/TR2025-049}
  - }
MERL Contacts:
Research Areas:

Artificial Intelligence, Computer Vision, Machine Learning

Abstract:

The task of estimating the world model by describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes and finds wide-spread use in applications such as video surveillance, robotics, autonomous driving, etc. This task entails synthesizing plausible visual futures, given a few frames of a video – necessary to set the visual context for the synthesis. Towards this end, different from end-to-end deep learning based approaches for video frame prediction, we propose ProgGen – which undertakes the task of video frame prediction by synthesizing computer programs which repre- sent the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize computer programs to: (i) estimate the states of the video, given the visual context (i.e.the frames); (ii) predict the states corresponding to future time steps by estimating the transition dynamics; (iii) render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld and (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and editability, attesting to its effectiveness and generalizability.

Related Publication

Tang, H., Ellis, K., Lohit, S., Jones, M.J., Chatterjee, M., "Programmatic Video Prediction Using Large Language Models", arXiv, May 2025.

BibTeX arXiv

@article{Tang2025may,
author = {Tang, Hao and Ellis, Kevin and Lohit, Suhas and Jones, Michael J. and Chatterjee, Moitreya},
title = {{Programmatic Video Prediction Using Large Language Models}},
journal = {arXiv},
year = 2025,
month = may,
url = {https://arxiv.org/abs/2505.14948}
}

TR2025-049

Programmatic Video Prediction Using Large Language Models

MERL Contacts:

Suhas
Lohit

Michael J.
Jones

Moitreya
Chatterjee

Research Areas:

Abstract:

Related Publication

MERL Contacts:

SuhasLohit

Michael J.Jones

MoitreyaChatterjee

Research Areas:

Abstract:

Suhas
Lohit

Michael J.
Jones

Moitreya
Chatterjee