TR2019-159

Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models


    •  Moritz, N., Hori, T., Le Roux, J., "Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2019, pp. 936-943.
      BibTeX TR2019-159 PDF
      • @inproceedings{Moritz2019dec,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2019,
      • pages = {936--943},
      • month = dec,
      • isbn = {978-1-7281-0305-1},
      • url = {https://www.merl.com/publications/TR2019-159}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

In this paper, we present a one-pass decoding algorithm for streaming recognition with joint connectionist temporal classification (CTC) and attention-based end-to-end automatic speech recognition (ASR) models. The decoding scheme is based on a frame-synchronous CTC prefix beam search algorithm and the recently proposed triggered attention concept. To achieve a fully streaming end-to-end ASR system, the CTC-triggered attention decoder is combined with a unidirectional encoder neural network based on parallel time-delayed long short-term memory (PTDLSTM) streams, which has demonstrated superior performance compared to various other streaming encoder architectures in earlier work. A new type of pre-training method is studied to further improve our streaming ASR models by adding residual connections to the encoder neural network and layer-wise removing them during the training process. The proposed joint CTC-triggered attention decoding algorithm, which enables streaming recognition of attention-based ASR systems, achieves similar ASR results compared to offline CTC-attention decoding and significantly better results compared to CTC prefix beam search decoding alone.