TR2021-093
Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers
-
- "Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers", Interspeech, DOI: 10.21437/Interspeech.2021-1975, August 2021, pp. 586–590.BibTeX TR2021-093 PDF
- @inproceedings{Hori2021aug2,
- author = {Hori, Chiori and Hori, Takaaki and Le Roux, Jonathan},
- title = {Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers},
- booktitle = {Interspeech},
- year = 2021,
- pages = {586–590},
- month = aug,
- publisher = {ISCA},
- doi = {10.21437/Interspeech.2021-1975},
- url = {https://www.merl.com/publications/TR2021-093}
- }
,
- "Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers", Interspeech, DOI: 10.21437/Interspeech.2021-1975, August 2021, pp. 586–590.
-
MERL Contacts:
-
Research Areas:
Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
Abstract:
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real surveillance systems, it is important not only to describe incidents accurately but also to produce captions as soon as possible. Low-latency captioning is required to realize such functionality, but this research area has not been pursued yet. This paper proposes a novel approach to optimize the output timing of each caption based on a trade-off between latency and caption quality. An audio-visual Transformer is trained to generate groundtruth captions using only a small number of frames without seeing all video frames and also mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a timing, where the captions generated by the two Transformers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in an early stage of the video clip as soon as an event happens or when it can be forecasted. Experiments with ActivityNet Captions dataset show that our approach achieves 90% of the caption quality given for complete video clips, using only 20% of frames.