Mandarin Speech Recognition in Noisy Environments

博士 === 國立清華大學 === 電機工程學系 === 88 === When an automatic speech recognition (ASR) system is practically deployed in the real world, environmental interference will make testing speech and reference models mismatched and cause serious degradation in recognition accuracy. The environmental interference i...

Full description

Bibliographic Details
Main Authors: HUNG WEI-WEN, 洪偉文
Other Authors: WANG HSIAO-CHUAN
Format: Others
Language:en_US
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/13530364988027329234
Description
Summary:博士 === 國立清華大學 === 電機工程學系 === 88 === When an automatic speech recognition (ASR) system is practically deployed in the real world, environmental interference will make testing speech and reference models mismatched and cause serious degradation in recognition accuracy. The environmental interference is generally attributed to some distortion factors, such as ambient noise, channel effect, Lombard effect and variation of speaker characteristics, etc. In this dissertation, we briefly focus our attention in the aspects of ambient noise and channel effect. A number of environment-robust techniques are proposed to alleviate the environmental interference by compensating those distortion factors in various domains, i.e., speech model domain, speech feature domain and speech signal domain. In speech model domain, the robustness of a state duration model to noise contamination is firstly investigated. It is known that when a speech signal is contaminated by ambient noise and/or channel effect, the decoded state sequence may be distorted. It may stay at some states too long or too short even with the helps of employing a state duration model. To make a decoding process more robust in a noisy environment, we present a proportional alignment decoding (PAD) algorithm for re-training a hidden Markov model. Based on the PAD method, the discrimination capability of the re-trained hidden Markov model can be significantly enhanced. Moreover, in the aspect of speech feature domain, a novel frame-dependent fuzzy channel compensation (FD-FCC) method employing two-stage bias subtraction is proposed to minimize the channel effect embedded in a telephone speech signal. Firstly, based on the maximum likelihood (ML) estimation a set of mixture biases can be derived by averaging the cepstral differences between the input utterance and the best-matched model. Then, instead of using a single bias, a frame-dependent bias is calculated for each input frame so as to equalize the channel variation in the input utterance. This frame-dependent bias is obtained by using the convex combination of those mixture biases that are weighted in terms of a fuzzy membership function. To increase the robustness of speech features, a fuzzy membership function is presented to perform an equalization of the variances of cepstral coefficients by weighting the frequency sequence of log filter-bank energies (LFBE) in a simple, direct and effective way. On the other hand, we also develop a reduced form of the FFBA technique to alleviate the uncertainty associated with determination of the fuzzy factor in FFBA technique. The reduced form of FFBA technique has comparable efficiency in equalization of cepstral variances as well as achieving syllable recognition rates close to those of the FFBA technique. As to the spectral variation of speech features due to environmental mismatch, an adaptive signal limiter (ASL) is developed to smooth the instantaneous and dynamic spectral features of reference models and testing speech by performing an arcsin transform in speech signal domain. The associated covariance matrices of reference models are also smoothed by means of the maximum likelihood (ML) estimation. In our approach, the smoothing degree of a signal limiter is tightly related to the signal-to-noise ratio (SNR) level of each testing frame and adaptively determined on a frame by frame basis. Moreover, we also show that the adaptive signal limiter could be combined with a noise-robust technique (NRT) by means of an interpolation scheme to obtain an additional performance improvement.