TR2023-152

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

- Pan, Z., Wichern, G., Masuyama, Y., Germain, F.G., Khurana, S., Hori, C., Le Roux, J., "Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/ASRU57964.2023.10389618, December 2023.
  BibTeX TR2023-152 PDF Video
  - @inproceedings{Pan2023dec2,
  - author = {Pan, Zexu and Wichern, Gordon and Masuyama, Yoshiki and Germain, François G and Khurana, Sameer and Hori, Chiori and {Le Roux}, Jonathan},
  - title = {{Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction}},
  - booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
  - year = 2023,
  - month = dec,
  - doi = {10.1109/ASRU57964.2023.10389618},
  - isbn = {979-8-3503-0689-7},
  - url = {https://www.merl.com/publications/TR2023-152}
  - }
MERL Contacts:
Research Areas:

Artificial Intelligence, Computer Vision, Speech & Audio

Abstract:

Target speech extraction aims to extract, based on a given con- ditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time- frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dis- similarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhance- ment Challenge, outperforming other models by a significant margin, objectively and in a listening test. We also perform an extensive analysis of the results under the two scenarios.

Related News & Events

AWARD MERL team wins the Audio-Visual Speech Enhancement (AVSE) 2023 Challenge
Date: December 16, 2023
Awarded to: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux
MERL Contacts: François Germain; Chiori Hori; Jonathan Le Roux; Gordon Wichern; Yoshiki Masuyama
Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
Brief
- MERL's Speech & Audio team ranked 1st out of 12 teams in the 2nd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSE). The team was led by Zexu Pan, and also included Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux.
  
  The AVSE challenge aims to design better speech enhancement systems by harnessing the visual aspects of speech (such as lip movements and gestures) in a manner similar to the brain’s multi-modal integration strategies. MERL’s system was a scenario-aware audio-visual TF-GridNet, that incorporates the face recording of a target speaker as a conditioning factor and also recognizes whether the predominant interference signal is speech or background noise. In addition to outperforming all competing systems in terms of objective metrics by a wide margin, in a listening test, MERL’s model achieved the best overall word intelligibility score of 84.54%, compared to 57.56% for the baseline and 80.41% for the next best team. The Fisher’s least significant difference (LSD) was 2.14%, indicating that our model offered statistically significant speech intelligibility improvements compared to all other systems.

Related Publication

Pan, Z., Wichern, G., Masuyama, Y., Germain, F.G., Khurana, S., Hori, C., Le Roux, J., "Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction", arXiv, October 2023.

BibTeX arXiv

@article{Pan2023oct,
author = {Pan, Zexu and Wichern, Gordon and Masuyama, Yoshiki and Germain, François G and Khurana, Sameer and Hori, Chiori and {Le Roux}, Jonathan},
title = {{Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction}},
journal = {arXiv},
year = 2023,
month = oct,
url = {http://arxiv.org/abs/2310.19644}
}

MERL Contacts:

GordonWichern

FrançoisGermain

ChioriHori

JonathanLe Roux

Research Areas:

Abstract:

Gordon
Wichern

François
Germain

Chiori
Hori

Jonathan
Le Roux