TR2025-049

Programmatic Video Prediction Using Large Language Models


Abstract:

The task of estimating the world model by describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes and finds wide-spread use in applications such as video surveillance, robotics, autonomous driving, etc. This task entails synthesizing plausible visual futures, given a few frames of a video – necessary to set the visual context for the synthesis. Towards this end, different from end-to-end deep learning based approaches for video frame prediction, we propose ProgGen – which undertakes the task of video frame prediction by synthesizing computer programs which repre- sent the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize computer programs to: (i) estimate the states of the video, given the visual context (i.e.the frames); (ii) predict the states corresponding to future time steps by estimating the transition dynamics; (iii) render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld and (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and editability, attesting to its effectiveness and generalizability.