E2E SPEECH RECOGNITION WITH CTC AND LOCAL ATTENTION

Chen, Jiahao; Nishimura, Ryota; Kitaoka, Norihide

Repository landing page

oai:lib.tokushima-u.ac.jp:repository/115877

E2E SPEECH RECOGNITION WITH CTC AND LOCAL ATTENTION

Authors: Jiahao Chen
Ryota Nishimura
Norihide Kitaoka
Publication date: 11 June 2021
Publisher: 'Cambridge University Press (CUP)'

Abstract

Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%

Similar works

Full text

Open in the Core reader

Download PDF

Tokushima University Institutional Repository

oai:lib.tokushima-u.ac.jp:repo...

Last time updated on 08/09/2021

This paper was published in Tokushima University Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.