TR2025-068

KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks


    •  Lu, K., Ma, C., Hori, C., Romeres, D., "KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks", IEEE International Conference on Robotics and Automation Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics (SafeLVMs@ICRA), May 2025.
      BibTeX TR2025-068 PDF
      • @inproceedings{Lu2025may,
      • author = {Lu, Kai and Ma, Chenyang and Hori, Chiori and Romeres, Diego},
      • title = {{KitchenVLA: Iterative Vision-Language Corrections for Robotic Execution of Human Tasks}},
      • booktitle = {IEEE International Conference on Robotics and Automation Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics (SafeLVMs@ICRA)},
      • year = 2025,
      • month = may,
      • url = {https://www.merl.com/publications/TR2025-068}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

In this paper, we present KitchenVLA, a Vision- Language-Action (VLA) framework for generating and optimizing executable robot actions from human instructional videos. While recent advances in video understanding and step generation have shown promising results, translating these steps into robot-executable actions remains challenging, particularly for complex, long-horizon tasks such as those in kitchen environments. These challenges arise from domain discrepancies between human videos and robotic settings, as well as mismatches between human actions and robot capabilities. To address these issues, we propose a zero-shot action planning and correction framework, where a Vision-Language Model (VLM) acts as an evaluator to analyze both the original human video and the robot’s observations to detect domain mismatches. The system assesses differences in object states and action feasibility, and generates corrective actions to align the robot’s execution with the intended task. By incorporating keyframe selection, language-guided segmentation, and simulation-based verification, KitchenVLA iteratively refines robotic plans to ensure contextual accuracy and executability. Through domain- aware evaluation and correction, our framework enhances the adaptability and robustness of robotic task execution in kitchen environments, advancing the integration of VLMs into robot learning and executable plan correction.