The Research on the Voice Activity Detection and Speech Enhancement for Noisy Speech Recognition

碩士 === 國立暨南國際大學 === 電機工程學系 === 93 === When a speech recognizer is applied in a real environment, its performance is often degraded seriously due to the existence of additive noise. In order to improve the robustness of the recognition system under noisy conditions, various approaches have been propo...

Full description

Bibliographic Details
Main Authors: Chen-Wei Lai, 賴辰瑋
Other Authors: Jeih-Weih Hung
Format: Others
Language:zh-TW
Published: 2005
Online Access:http://ndltd.ncl.edu.tw/handle/06070640933840270072
Description
Summary:碩士 === 國立暨南國際大學 === 電機工程學系 === 93 === When a speech recognizer is applied in a real environment, its performance is often degraded seriously due to the existence of additive noise. In order to improve the robustness of the recognition system under noisy conditions, various approaches have been proposed, one direction of these approaches is attempt to detect the presence the presence of noise, to estimate the characteristics of the noise and then to remove or alleviate the noise in speech signals. In the thesis, we first study several voice activity detection (endpoint detection) approaches, which may detect the noise-only portions in a speech sequence. Then the noise statistics can be estimated via these noise portions. These approaches include order statistic filter (OSF), subband order statistic filter(SOSF), long-term spectrum divergence(LTSD), Kullback-Leibler distance(KL),energy and entropy, experimental results show that K-L distance method performs the best. That is, it gives the endpoints of noise-only portions closest to those obtained manually. Secondly, the speech enhancement approaches are studied, which try to reduce the noise component within the speech signal in different domains. For example, Nonlinear Spectral Subtraction(NSS) and Wiener Filter(WF) perform in linear spectral domain, Mel Spectral Subtraction(MSS) performs in mel spectral domain. Furthermore, we propose the Cepstral Statistics Compensation(CSC) method, which performs in cepstral domain, it is found that the effect of these back-end speech enhancement approaches in general depends on the accuracy of the front-end VAD, and CSC gives the optimal recognition rates among all approaches. CSCeven performs better than two popular temporal filtering approaches, Cepstaral mean subtraction(CMS) and Cepsral normalization(CN). In conclusion, robust VAD and speech enhancement approaches can effectively improve the noisy speech recognition, and have one special advantage. That is since they just perform on the speech to be recognized, it is no need to adjust the recognition models.