TR2023-104

Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos

- Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos", Interspeech, DOI: 10.21437/Interspeech.2023-1983, August 2023, pp. 4663-4667.
  BibTeX TR2023-104 PDF
  - @inproceedings{Hori2023aug,
  - author = {Hori, Chiori and Peng, Puyuang and Harwath, David and Liu, Xinyu and Ota, Kei and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
  - title = {{Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos}},
  - booktitle = {Interspeech},
  - year = 2023,
  - pages = {4663--4667},
  - month = aug,
  - doi = {10.21437/Interspeech.2023-1983},
  - url = {https://www.merl.com/publications/TR2023-104}
  - }
MERL Contacts:
Research Areas:

Artificial Intelligence, Computer Vision, Machine Learning, Robotics, Speech & Audio

Abstract:

To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal in- structions in their demonstrations, showing a sequence of short- horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that con- verts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Ex- periments with Epic-Kitchen-100, YouCookII, QuerYD, and in- house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Trans- former. The model achieved 32% of the task success rate with the task knowledge of the object.

Related Publication

Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos", arXiv, June 2023.

BibTeX arXiv

@article{Hori2023jun,
author = {Hori, Chiori and Peng, Puyuang and Harwath, David and Liu, Xinyu and Ota, Kei and Jain, Siddarth and Corcodel, Radu and Jha, Devesh K. and Romeres, Diego and {Le Roux}, Jonathan},
title = {{Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos}},
journal = {arXiv},
year = 2023,
month = jun,
url = {https://arxiv.org/abs/2306.15644}
}

TR2023-104

Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos

MERL Contacts:

Chiori
Hori

Siddarth
Jain

Radu
Corcodel

Devesh K.
Jha

Diego
Romeres

Jonathan
Le Roux

Research Areas:

Abstract:

Related Publication

MERL Contacts:

ChioriHori

SiddarthJain

RaduCorcodel

Devesh K.Jha

DiegoRomeres

JonathanLe Roux

Research Areas:

Abstract:

Chiori
Hori

Siddarth
Jain

Radu
Corcodel

Devesh K.
Jha

Diego
Romeres

Jonathan
Le Roux