Summary: | 博士 === 臺灣大學 === 電信工程學研究所 === 96 === Cepstral normalization has been popularly used as a powerful approach to produce
robust features for speech recognition. Good examples of approaches include the well
known Cepstral Mean Subtraction (CMS) and Cepstral Mean and Variance
Normalization (CMVN), in which either the first or both the first and the second
moments of the Mel-frequency Cepstral Coefficients (MFCCs) are normalized. In
this dissertation, we proposed a family of generalized cepstral normalization
techniques with higher power/moment order based on two closely related approaches.
The first approach is to try to normalize the MFCC parameters with respect to a few
moments of higher orders, i.e., with orders higher than 1 or 2. The basic idea is that
the higher order moments are more dominated by samples with larger values, which
are very likely the primary sources of the asymmetry and abnormal flatness or tail
size of the parameter distributions. Normalization with respect to these moments
therefore puts more emphasis on these signal components and constrains the
distributions to be more symmetric with more reasonable flatness and tail size. This is
referred to as the Higher Order Cepstral Moment Normalization (HOCMN) in this
dissertation.
The second approach, Powered Cepstral Normalization (P-CN), is an improved
approach proposed to normalize the MFCC parameters in the r1-th powered domain,
iv
where r1 > 1.0. The basic idea is that when the MFCC parameters are raised to a
higher-order power, or the r1-th power, the harmful parts of environmental
disturbances may be more emphasized than the speech features which are relatively
smooth. Therefore performing the normalization in the domain of a higher-order
power may be more helpful. Then we transform the features back by an 1/ r2 power
order to a recognition domain where the acoustic events can be better distinguished.
The unified formulation of the generalized cepstral normalization with higher
power/moment order presented in this dissertation can be reduced to either HOCMN
or P-CN as mentioned above, or integrate both of them together. Experimental results
based on AURORA 2.0 testing environment showed that the recognition accuracy
can be significantly improved consistently with the approaches proposed here for all
types of noise and all SNR conditions. Fundamental principles behind the approaches
proposed here are also analyzed and discussed based on the statistical properties of
the distributions of the MFCC parameters.
|