End-to-End Speech Recognition Sequence Training With Reinforcement Learning

End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step pre...

Full description

Bibliographic Details
Main Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8735756/
id doaj-70855970c18b40568b9aa2877c43bc09
record_format Article
spelling doaj-70855970c18b40568b9aa2877c43bc092021-03-30T00:09:57ZengIEEEIEEE Access2169-35362019-01-017797587976910.1109/ACCESS.2019.29226178735756End-to-End Speech Recognition Sequence Training With Reinforcement LearningAndros Tjandra0https://orcid.org/0000-0003-1246-5908Sakriani Sakti1Satoshi Nakamura2Nara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanEnd-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.https://ieeexplore.ieee.org/document/8735756/End-to-end sequence modelspeech recognitionpolicy gradient optimizationreinforcement learning
collection DOAJ
language English
format Article
sources DOAJ
author Andros Tjandra
Sakriani Sakti
Satoshi Nakamura
spellingShingle Andros Tjandra
Sakriani Sakti
Satoshi Nakamura
End-to-End Speech Recognition Sequence Training With Reinforcement Learning
IEEE Access
End-to-end sequence model
speech recognition
policy gradient optimization
reinforcement learning
author_facet Andros Tjandra
Sakriani Sakti
Satoshi Nakamura
author_sort Andros Tjandra
title End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_short End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_full End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_fullStr End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_full_unstemmed End-to-End Speech Recognition Sequence Training With Reinforcement Learning
title_sort end-to-end speech recognition sequence training with reinforcement learning
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.
topic End-to-end sequence model
speech recognition
policy gradient optimization
reinforcement learning
url https://ieeexplore.ieee.org/document/8735756/
work_keys_str_mv AT androstjandra endtoendspeechrecognitionsequencetrainingwithreinforcementlearning
AT sakrianisakti endtoendspeechrecognitionsequencetrainingwithreinforcementlearning
AT satoshinakamura endtoendspeechrecognitionsequencetrainingwithreinforcementlearning
_version_ 1724188555648958464