Generalized Cepstral Normalization with Higher Power/Moment Order for Robust Speech Recognition

博士 === 臺灣大學 === 電信工程學研究所 === 96 === Cepstral normalization has been popularly used as a powerful approach to produce robust features for speech recognition. Good examples of approaches include the well known Cepstral Mean Subtraction (CMS) and Cepstral Mean and Variance Normalization (CMVN), in whic...

Full description

Bibliographic Details
Main Authors: Chang-Wen Hsu, 許長文
Other Authors: Lin-Shan Lee
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/63371777527400813195
Description
Summary:博士 === 臺灣大學 === 電信工程學研究所 === 96 === Cepstral normalization has been popularly used as a powerful approach to produce robust features for speech recognition. Good examples of approaches include the well known Cepstral Mean Subtraction (CMS) and Cepstral Mean and Variance Normalization (CMVN), in which either the first or both the first and the second moments of the Mel-frequency Cepstral Coefficients (MFCCs) are normalized. In this dissertation, we proposed a family of generalized cepstral normalization techniques with higher power/moment order based on two closely related approaches. The first approach is to try to normalize the MFCC parameters with respect to a few moments of higher orders, i.e., with orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. This is referred to as the Higher Order Cepstral Moment Normalization (HOCMN) in this dissertation. The second approach, Powered Cepstral Normalization (P-CN), is an improved approach proposed to normalize the MFCC parameters in the r1-th powered domain, iv where r1 > 1.0. The basic idea is that when the MFCC parameters are raised to a higher-order power, or the r1-th power, the harmful parts of environmental disturbances may be more emphasized than the speech features which are relatively smooth. Therefore performing the normalization in the domain of a higher-order power may be more helpful. Then we transform the features back by an 1/ r2 power order to a recognition domain where the acoustic events can be better distinguished. The unified formulation of the generalized cepstral normalization with higher power/moment order presented in this dissertation can be reduced to either HOCMN or P-CN as mentioned above, or integrate both of them together. Experimental results based on AURORA 2.0 testing environment showed that the recognition accuracy can be significantly improved consistently with the approaches proposed here for all types of noise and all SNR conditions. Fundamental principles behind the approaches proposed here are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters.