End-to-End Speech Recognition Sequence Training With Reinforcement Learning

End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step pre...

Full description

Bibliographic Details
Main Authors:	Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	End-to-end sequence model speech recognition policy gradient optimization reinforcement learning
Online Access:	https://ieeexplore.ieee.org/document/8735756/

id	doaj-70855970c18b40568b9aa2877c43bc09
record_format	Article
spelling	doaj-70855970c18b40568b9aa2877c43bc092021-03-30T00:09:57ZengIEEEIEEE Access2169-35362019-01-017797587976910.1109/ACCESS.2019.29226178735756End-to-End Speech Recognition Sequence Training With Reinforcement LearningAndros Tjandra0https://orcid.org/0000-0003-1246-5908Sakriani Sakti1Satoshi Nakamura2Nara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanEnd-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.https://ieeexplore.ieee.org/document/8735756/End-to-end sequence modelspeech recognitionpolicy gradient optimizationreinforcement learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Andros Tjandra Sakriani Sakti Satoshi Nakamura
spellingShingle	Andros Tjandra Sakriani Sakti Satoshi Nakamura End-to-End Speech Recognition Sequence Training With Reinforcement Learning IEEE Access End-to-end sequence model speech recognition policy gradient optimization reinforcement learning
author_facet	Andros Tjandra Sakriani Sakti Satoshi Nakamura
author_sort	Andros Tjandra
title	End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_short	End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_full	End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_fullStr	End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_full_unstemmed	End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_sort	end-to-end speech recognition sequence training with reinforcement learning
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.
topic	End-to-end sequence model speech recognition policy gradient optimization reinforcement learning
url	https://ieeexplore.ieee.org/document/8735756/
work_keys_str_mv	AT androstjandra endtoendspeechrecognitionsequencetrainingwithreinforcementlearning AT sakrianisakti endtoendspeechrecognitionsequencetrainingwithreinforcementlearning AT satoshinakamura endtoendspeechrecognitionsequencetrainingwithreinforcementlearning
_version_	1724188555648958464

End-to-End Speech Recognition Sequence Training With Reinforcement Learning

Similar Items