End-to-End Speech Recognition Sequence Training With Reinforcement Learning
End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step pre...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8735756/ |
id |
doaj-70855970c18b40568b9aa2877c43bc09 |
---|---|
record_format |
Article |
spelling |
doaj-70855970c18b40568b9aa2877c43bc092021-03-30T00:09:57ZengIEEEIEEE Access2169-35362019-01-017797587976910.1109/ACCESS.2019.29226178735756End-to-End Speech Recognition Sequence Training With Reinforcement LearningAndros Tjandra0https://orcid.org/0000-0003-1246-5908Sakriani Sakti1Satoshi Nakamura2Nara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanNara Institute of Science and Technology, Nara, JapanEnd-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.https://ieeexplore.ieee.org/document/8735756/End-to-end sequence modelspeech recognitionpolicy gradient optimizationreinforcement learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Andros Tjandra Sakriani Sakti Satoshi Nakamura |
spellingShingle |
Andros Tjandra Sakriani Sakti Satoshi Nakamura End-to-End Speech Recognition Sequence Training With Reinforcement Learning IEEE Access End-to-end sequence model speech recognition policy gradient optimization reinforcement learning |
author_facet |
Andros Tjandra Sakriani Sakti Satoshi Nakamura |
author_sort |
Andros Tjandra |
title |
End-to-End Speech Recognition Sequence Training With Reinforcement Learning |
title_short |
End-to-End Speech Recognition Sequence Training With Reinforcement Learning |
title_full |
End-to-End Speech Recognition Sequence Training With Reinforcement Learning |
title_fullStr |
End-to-End Speech Recognition Sequence Training With Reinforcement Learning |
title_full_unstemmed |
End-to-End Speech Recognition Sequence Training With Reinforcement Learning |
title_sort |
end-to-end speech recognition sequence training with reinforcement learning |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function. |
topic |
End-to-end sequence model speech recognition policy gradient optimization reinforcement learning |
url |
https://ieeexplore.ieee.org/document/8735756/ |
work_keys_str_mv |
AT androstjandra endtoendspeechrecognitionsequencetrainingwithreinforcementlearning AT sakrianisakti endtoendspeechrecognitionsequencetrainingwithreinforcementlearning AT satoshinakamura endtoendspeechrecognitionsequencetrainingwithreinforcementlearning |
_version_ |
1724188555648958464 |