End-to-End Speech Recognition Sequence Training With Reinforcement Learning

End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step pre...

Full description

Bibliographic Details
Main Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8735756/
Description
Summary:End-to-end sequence modeling has become a popular choice for automatic speech recognition (ASR) because of the simpler pipeline compared to the conventional system and its excellent performance. However, there are several drawbacks in the end-to-end ASR model training where the current time-step prediction on the target side are conditioned with the ground truth transcription and speech features. In the inference stage, the condition is different because the model does not have any access to the target sequence ground-truth, thus any mistakes might be accumulated and degrade the decoding result over time. Another issue is raised because of the discrepancy between training and evaluation objective. In the training stage, maximum likelihood estimation criterion is used as the objective function. However, the ASR systems quality is evaluated based on the word error rate via Levenshtein distance. Therefore, we present an alternative for optimizing end-to-end ASR model with one of the reinforcement learning method called policy gradient. The model trained the proposed approach has several advantages: (1) the model simulates the inference stage by free sampling process and uses its own sample as the input, and; (2) optimize the model with a reward function correlated with the ASR evaluation metric (e.g., negative Levenshtein distance). Based on the result from our experiment, our proposed method significantly improve the model performance compared to a model trained only with teacher forcing and maximum likelihood objective function.
ISSN:2169-3536