TR2019-098

Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition

- Moritz, N., Hori, T., Le Roux, J., "Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition", Interspeech, DOI: 10.21437/Interspeech.2019-2837, September 2019, pp. 76-80.
  BibTeX TR2019-098 PDF
  - @inproceedings{Moritz2019sep,
  - author = {Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
  - title = {{Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition}},
  - booktitle = {Interspeech},
  - year = 2019,
  - pages = {76--80},
  - month = sep,
  - doi = {10.21437/Interspeech.2019-2837},
  - url = {https://www.merl.com/publications/TR2019-098}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

In hybrid automatic speech recognition (ASR) systems, neural networks are used as acoustic models (AMs) to recognize phonemes that are composed to words and sentences using pronunciation dictionaries, hidden Markov models, and language models, which can be jointly represented by a weighted finite state transducer (WFST). The importance of capturing temporal context by an AM has been studied and discussed in prior work. In an end-to-end ASR system, however, all components are merged into a single neural network, i.e., the breakdown into an AM and the different parts of the WFST model is no longer possible. This implies that end-to-end neural network architectures have even stronger requirements for processing long contextual information. Bidirectional long short-term memory (BLSTM) neural networks have demonstrated state-of-the-art results in end-to-end ASR but are unsuitable for streaming applications. Latency-controlled BLSTMs account for this by limiting the future context seen by the backward directed recurrence using chunk-wise processing. In this paper, we propose two new unidirectional neural network architectures, the timedelay LSTM (TDLSTM) and the parallel time-delayed LSTM (PTDLSTM) streams, which both limit the processing latency to a fixed size and demonstrate significant improvements compared to prior art on a variety of ASR tasks.

Related News & Events

NEWS MERL Speech & Audio Researchers Presenting 7 Papers and a Tutorial at Interspeech 2019
Date: September 15, 2019 - September 19, 2019
Where: Graz, Austria
MERL Contacts: Chiori Hori; Jonathan Le Roux; Gordon Wichern
Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
Brief
- MERL Speech & Audio Team researchers will be presenting 7 papers at the 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019, which is being held in Graz, Austria from September 15-19, 2019. Topics to be presented include recent advances in end-to-end speech recognition, speech separation, and audio-visual scene-aware dialog. Takaaki Hori is also co-presenting a tutorial on end-to-end speech processing.
  
  Interspeech is the world's largest and most comprehensive conference on the science and technology of spoken language processing. It gathers around 2000 participants from all over the world.

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux