Deep Factorized and Variational Learning for Speech Recognition

碩士 === 國立交通大學 === 電信工程研究所 === 105 === Deep neural network (DNN) has been recognized as a new trend for modern speech recognition. Many extensions and realizations have been developing to further improve the system performance and discover the meaningful insights from different perspectives. This stu...

Full description

Bibliographic Details
Main Authors: Shen, Chen, 沈辰
Other Authors: Chien, Jen-Tzung
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/03158094775165006006
id ndltd-TW-105NCTU5435021
record_format oai_dc
spelling ndltd-TW-105NCTU54350212017-09-06T04:22:27Z http://ndltd.ncl.edu.tw/handle/03158094775165006006 Deep Factorized and Variational Learning for Speech Recognition 深層分解及變異學習於語音辨識之研究 Shen, Chen 沈辰 碩士 國立交通大學 電信工程研究所 105 Deep neural network (DNN) has been recognized as a new trend for modern speech recognition. Many extensions and realizations have been developing to further improve the system performance and discover the meaningful insights from different perspectives. This study aims to explore the structural information from multi-way speech observations and incorporate the stochastic point of view into representation learning. We present the factorized and variational learning for speech recognition based on the recurrent neural network (RNN) where the hidden state from neighboring time steps is merged as a memory for cyclic and temporal modeling. Such an RNN model is also extended to the realization of long short-term memory (LSTM) where a number of gates and cells are introduced to capture long time dependency. LSTM can also reduce the exploding and vanishing problems in training procedure of RNN. Two new types of models based on matrix factorized neural network (MFNN) and variational recurrent neural network (VRNN) are proposed. First of all, we deal with the constraint of system capability of conventional DNN caused by the loss of contextual correlation in temporal and spatial horizons due to unfolding the temporal-frequency observation matrices into vector-based inputs. MFNN is a generalization of vector-based neural network (NN) which performs the matrix factorization and nonlinear activation for input matrices in the layer-wise forward computation. Affine transformation in NN is replaced by Tucker decomposition in MFNN. Such a calculation does not only preserve the spatial information in frequency domain but also extract the temporal pattern in time domain. In this study, a deep model based on MFNN is built by cascading a number of factorization layers with fully-connected layers before connecting to the softmax outputs for speech recognition. This model is further extended to the matrix factorized LSTM where the multiplications in input gate and output gate are replaced by Tucker decomposition. Multiple acoustic features are also considered as the tensor input to carry out the tensor-factorized NN (TFNN). On the other hand, we propose an VRNN for acoustic modeling which is seen as a stochastic realization of RNN. By reflecting the nature of stochastic property in RNN, we can improve the representation capability as well as the speech recognition performance. To do so, we conduct the variational inference for latent variable model based on RNN. Motivated by the variational auto-encoder (VAE), we carry out a new type of stochastic back-propagation algorithm where a sampling method is used for efficient implementation and approximation in training procedure. In this recurrent VAE, we introduce the class targets and optimize the variational lower bound for supervised RNN which is composed of two parts. One is the Kullback-Leiblier divergence between posterior distribution and variational distribution of latent variables. The other one is the cross entropy error function for network outputs and class targets. Beyond traditional RNN, the proposed VRNN characterizes the dependencies between latent variables across subsequent time steps. In the experiments, we carry out the proposed methods by using the open-source Kaldi toolkit. The multi-way feature extraction ability of MFNN will be illustrated by showing the phenomenon of hidden neurons due to the factor weights in individual ways. The word error rate (WER) for speech recognition using TIMIT and Aurora-4 will be reported. Chien, Jen-Tzung 簡仁宗 2016 學位論文 ; thesis 133 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立交通大學 === 電信工程研究所 === 105 === Deep neural network (DNN) has been recognized as a new trend for modern speech recognition. Many extensions and realizations have been developing to further improve the system performance and discover the meaningful insights from different perspectives. This study aims to explore the structural information from multi-way speech observations and incorporate the stochastic point of view into representation learning. We present the factorized and variational learning for speech recognition based on the recurrent neural network (RNN) where the hidden state from neighboring time steps is merged as a memory for cyclic and temporal modeling. Such an RNN model is also extended to the realization of long short-term memory (LSTM) where a number of gates and cells are introduced to capture long time dependency. LSTM can also reduce the exploding and vanishing problems in training procedure of RNN. Two new types of models based on matrix factorized neural network (MFNN) and variational recurrent neural network (VRNN) are proposed. First of all, we deal with the constraint of system capability of conventional DNN caused by the loss of contextual correlation in temporal and spatial horizons due to unfolding the temporal-frequency observation matrices into vector-based inputs. MFNN is a generalization of vector-based neural network (NN) which performs the matrix factorization and nonlinear activation for input matrices in the layer-wise forward computation. Affine transformation in NN is replaced by Tucker decomposition in MFNN. Such a calculation does not only preserve the spatial information in frequency domain but also extract the temporal pattern in time domain. In this study, a deep model based on MFNN is built by cascading a number of factorization layers with fully-connected layers before connecting to the softmax outputs for speech recognition. This model is further extended to the matrix factorized LSTM where the multiplications in input gate and output gate are replaced by Tucker decomposition. Multiple acoustic features are also considered as the tensor input to carry out the tensor-factorized NN (TFNN). On the other hand, we propose an VRNN for acoustic modeling which is seen as a stochastic realization of RNN. By reflecting the nature of stochastic property in RNN, we can improve the representation capability as well as the speech recognition performance. To do so, we conduct the variational inference for latent variable model based on RNN. Motivated by the variational auto-encoder (VAE), we carry out a new type of stochastic back-propagation algorithm where a sampling method is used for efficient implementation and approximation in training procedure. In this recurrent VAE, we introduce the class targets and optimize the variational lower bound for supervised RNN which is composed of two parts. One is the Kullback-Leiblier divergence between posterior distribution and variational distribution of latent variables. The other one is the cross entropy error function for network outputs and class targets. Beyond traditional RNN, the proposed VRNN characterizes the dependencies between latent variables across subsequent time steps. In the experiments, we carry out the proposed methods by using the open-source Kaldi toolkit. The multi-way feature extraction ability of MFNN will be illustrated by showing the phenomenon of hidden neurons due to the factor weights in individual ways. The word error rate (WER) for speech recognition using TIMIT and Aurora-4 will be reported.
author2 Chien, Jen-Tzung
author_facet Chien, Jen-Tzung
Shen, Chen
沈辰
author Shen, Chen
沈辰
spellingShingle Shen, Chen
沈辰
Deep Factorized and Variational Learning for Speech Recognition
author_sort Shen, Chen
title Deep Factorized and Variational Learning for Speech Recognition
title_short Deep Factorized and Variational Learning for Speech Recognition
title_full Deep Factorized and Variational Learning for Speech Recognition
title_fullStr Deep Factorized and Variational Learning for Speech Recognition
title_full_unstemmed Deep Factorized and Variational Learning for Speech Recognition
title_sort deep factorized and variational learning for speech recognition
publishDate 2016
url http://ndltd.ncl.edu.tw/handle/03158094775165006006
work_keys_str_mv AT shenchen deepfactorizedandvariationallearningforspeechrecognition
AT chénchén deepfactorizedandvariationallearningforspeechrecognition
AT shenchen shēncéngfēnjiějíbiànyìxuéxíyúyǔyīnbiànshízhīyánjiū
AT chénchén shēncéngfēnjiějíbiànyìxuéxíyúyǔyīnbiànshízhīyánjiū
_version_ 1718527900683075584