TR2017-103

Joint CTC/attention decoding for end-to-end speech recognition


    •  Hori, T., Watanabe, S., Hershey, J.R., "Joint CTC/attention decoding for end-to-end speech recognition", Association for Computational Linguistics (ACL), DOI: 10.18653/​v1/​P17-1048, July 2017, pp. 518-529.
      BibTeX TR2017-103 PDF Video
      • @inproceedings{Hori2017jul,
      • author = {Hori, Takaaki and Watanabe, Shinji and Hershey, John R.},
      • title = {Joint CTC/attention decoding for end-to-end speech recognition},
      • booktitle = {Association for Computational Linguistics (ACL)},
      • year = 2017,
      • pages = {518--529},
      • month = jul,
      • doi = {10.18653/v1/P17-1048},
      • url = {https://www.merl.com/publications/TR2017-103}
      • }
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and contextdependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

 

  • Related News & Events

  • Related Video

  • Related Research Highlights