TR2025-049

Programmatic Video Prediction Using Large Language Models


    •  Tang, H., Ellis, K., Lohit, S., Jones, M.J., Chatterjee, M., "Programmatic Video Prediction Using Large Language Models", International Conference on Learning Representations Workshops (ICLRW), April 2025.
      BibTeX TR2025-049 PDF
      • @inproceedings{Tang2025apr,
      • author = {Tang, Hao and Ellis, Kevin and Lohit, Suhas and Jones, Michael J. and Chatterjee, Moitreya},
      • title = {{Programmatic Video Prediction Using Large Language Models}},
      • booktitle = {International Conference on Learning Representations Workshops (ICLRW)},
      • year = 2025,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2025-049}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning

Abstract:

The task of estimating the world model by describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes and finds wide-spread use in applications such as video surveillance, robotics, autonomous driving, etc. This task entails synthesizing plausible visual futures, given a few frames of a video – necessary to set the visual context for the synthesis. Towards this end, different from end-to-end deep learning based approaches for video frame prediction, we propose ProgGen – which undertakes the task of video frame prediction by synthesizing computer programs which repre- sent the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize computer programs to: (i) estimate the states of the video, given the visual context (i.e.the frames); (ii) predict the states corresponding to future time steps by estimating the transition dynamics; (iii) render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld and (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and editability, attesting to its effectiveness and generalizability.