Segment boundary detection directed attention for online end-to-end speech recognition

Abstract Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional atten...

Full description

Bibliographic Details
Main Authors: Junfeng Hou, Wu Guo, Yan Song, Li-Rong Dai
Format: Article
Language:English
Published: SpringerOpen 2020-01-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-020-0170-z
id doaj-b3a603727b4349f38c534610797ae686
record_format Article
spelling doaj-b3a603727b4349f38c534610797ae6862021-01-31T16:11:04ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222020-01-012020111610.1186/s13636-020-0170-zSegment boundary detection directed attention for online end-to-end speech recognitionJunfeng Hou0Wu Guo1Yan Song2Li-Rong Dai3National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of ChinaNational Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of ChinaNational Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of ChinaNational Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of ChinaAbstract Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models for online recognition must learn online alignment to attend part of input sequence monotonically when generating output symbols. Based on the fact that every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation—segment boundary detection with hard decision—and segment-directed attention—information aggregation within the segment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decision is made for each input frame about whether it is a segment boundary or not. When a boundary is detected, the decoder generates an output symbol by attending the inputs within the corresponding segment. With the proposed attention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ dataset show that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.https://doi.org/10.1186/s13636-020-0170-zEncoder-decoderOnline recognitionBoundary detectionAttention mechanismReinforcement learningPolicy gradient
collection DOAJ
language English
format Article
sources DOAJ
author Junfeng Hou
Wu Guo
Yan Song
Li-Rong Dai
spellingShingle Junfeng Hou
Wu Guo
Yan Song
Li-Rong Dai
Segment boundary detection directed attention for online end-to-end speech recognition
EURASIP Journal on Audio, Speech, and Music Processing
Encoder-decoder
Online recognition
Boundary detection
Attention mechanism
Reinforcement learning
Policy gradient
author_facet Junfeng Hou
Wu Guo
Yan Song
Li-Rong Dai
author_sort Junfeng Hou
title Segment boundary detection directed attention for online end-to-end speech recognition
title_short Segment boundary detection directed attention for online end-to-end speech recognition
title_full Segment boundary detection directed attention for online end-to-end speech recognition
title_fullStr Segment boundary detection directed attention for online end-to-end speech recognition
title_full_unstemmed Segment boundary detection directed attention for online end-to-end speech recognition
title_sort segment boundary detection directed attention for online end-to-end speech recognition
publisher SpringerOpen
series EURASIP Journal on Audio, Speech, and Music Processing
issn 1687-4722
publishDate 2020-01-01
description Abstract Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models for online recognition must learn online alignment to attend part of input sequence monotonically when generating output symbols. Based on the fact that every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation—segment boundary detection with hard decision—and segment-directed attention—information aggregation within the segment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decision is made for each input frame about whether it is a segment boundary or not. When a boundary is detected, the decoder generates an output symbol by attending the inputs within the corresponding segment. With the proposed attention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ dataset show that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.
topic Encoder-decoder
Online recognition
Boundary detection
Attention mechanism
Reinforcement learning
Policy gradient
url https://doi.org/10.1186/s13636-020-0170-z
work_keys_str_mv AT junfenghou segmentboundarydetectiondirectedattentionforonlineendtoendspeechrecognition
AT wuguo segmentboundarydetectiondirectedattentionforonlineendtoendspeechrecognition
AT yansong segmentboundarydetectiondirectedattentionforonlineendtoendspeechrecognition
AT lirongdai segmentboundarydetectiondirectedattentionforonlineendtoendspeechrecognition
_version_ 1724316765715955712