TR2023-104

Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos


    •  Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos", Interspeech, DOI: 10.21437/​Interspeech.2023-1983, August 2023, pp. 4663-4667.
      BibTeX TR2023-104 PDF
      • @inproceedings{Hori2023aug,
      • author = {Hori, Chiori and Peng, Puyuang and Harwath, David and Liu, Xinyu and Ota, Kei and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
      • title = {{Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos}},
      • booktitle = {Interspeech},
      • year = 2023,
      • pages = {4663--4667},
      • month = aug,
      • doi = {10.21437/Interspeech.2023-1983},
      • url = {https://www.merl.com/publications/TR2023-104}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal in- structions in their demonstrations, showing a sequence of short- horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that con- verts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Ex- periments with Epic-Kitchen-100, YouCookII, QuerYD, and in- house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Trans- former. The model achieved 32% of the task success rate with the task knowledge of the object.

 

  • Related Publication

  •  Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos", arXiv, June 2023.
    BibTeX arXiv
    • @article{Hori2023jun,
    • author = {Hori, Chiori and Peng, Puyuang and Harwath, David and Liu, Xinyu and Ota, Kei and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
    • title = {{Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos}},
    • journal = {arXiv},
    • year = 2023,
    • month = jun,
    • url = {https://arxiv.org/abs/2306.15644}
    • }