Summary: | This thesis is an investigation into the ability of artificial neural networks to learn to map from a symbolic representation of CVC triphones to a continuous representation of vowel formant tracks, and the influence of a number of factors on that ability. This mapping is interesting because, apart from being a necessary part of any text to speech system and not having any accepted definitive solution, it is from a discrete symbolic representation to a continuous non-symbolic representation. Neural networks provide one method of automatically learning such mappings and prove to be capable of doing so in this particular case. The input representation used appears to have little effect on the performance of the neural networks. A feature based representation does no better than a 1-of-n coding of the phonemes. The representation of the vowel formant tracks, produced as output of the neural networks, has a far greater effect on performance. Simple representations consisting of the initial, central and final frequencies of the formant tracks out-perform polynomial and Fourier coefficient representations which encode more information about the shape of the formant tracks. The back-propagation and conjugate gradient neural network training algorithms produced neural networks with similar performance, and the use of cross-validation made no difference in generalisation (although the cross-validation data set was far too small). Interestingly, neural networks with no hidden layer proved to be as capable of learning the mapping as those with a hidden layer, indicating that the mapping is not substantially non-linear.
|