TR2017-021

End-to-end ASR without using morphological analyzer, pronunciation dictionary and language model


    •  Watanabe, S., Hori, T., Hayashi, T., Kim, S., "End-to-end ASR without using morphological analyzer, pronunciation dictionary and language model", Acoustical Society of Japan Spring Meeting (ASJ), March 2017.
      BibTeX TR2017-021 PDF
      • @inproceedings{Watanabe2017mar2,
      • author = {Watanabe, Shinji and Hori, Takaaki and Hayashi, Tomoki and Kim, Suyoun},
      • title = {End-to-end ASR without using morphological analyzer, pronunciation dictionary and language model},
      • booktitle = {Acoustical Society of Japan Spring Meeting (ASJ)},
      • year = 2017,
      • month = mar,
      • url = {https://www.merl.com/publications/TR2017-021}
      • }
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

This paper introduces Japanese end-to-end ASR system based on a joint CTC/attention scheme [1], which is an extension of attention-based ASR [2] by using multi-task learning to incorporate the Connectionist Temporal Classification (CTC) objective. Unlike the conventional Japanese ASR systems based on DNN/HMM hybrid [3] or end-to-end systems with Japanese syllable characters (i.e., hiragana or katakana) [4], this method directly predicts a Japanese sentence based on a standard Japanese character set including Kanji, hiragana, and katakana characters, Roman/Greek alphabets, Arabic numbers, and so on. Thus, the method does not use any pronunciation dictionary, which requires hand-crafted work by human. In addition, since it's based on character based recognition, it does not require a morphological analyzer to chunk a character sequence to a word sequence. Finally, attention mechanism itself holds a language-model-like function in the decoder network, unlike a Japanese end-to-end system based on CTC [5]. Therefore, it does not require a separate language model module, which makes system construction and decoding process very simple.