A Study on Data Fusion Strategy for Audio-Visual Emotion Recognition

博士 === 國立成功大學 === 資訊工程學系 === 102 === Recent years have seen increased attention being given to research topic in automatic audio-visual emotion recognition. To increase the recognition accuracy, data fusion strategy, that is, related to how effectively integrate the audio and visual cues became the...

Full description

Bibliographic Details
Main Authors: Jen-ChunLin, 林仁俊
Other Authors: Chung-Hsien Wu
Format: Others
Language:en_US
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/39485230414627899693
Description
Summary:博士 === 國立成功大學 === 資訊工程學系 === 102 === Recent years have seen increased attention being given to research topic in automatic audio-visual emotion recognition. To increase the recognition accuracy, data fusion strategy, that is, related to how effectively integrate the audio and visual cues became the major research issue. The fusion operations reported can be classified into three major categories: feature-level fusion, decision-level fusion, and model-level fusion for audio-visual emotion recognition. Obviously, the different data fusion strategies have different characteristics and distinct advantages and disadvantages. According to the analysis of characteristics of current data fusion strategies, this dissertation firstly presented a hybrid fusion method to effectively integrate the advantages of data fusion strategies of different characteristics for increasing the recognition performance. This dissertation presented a hybrid fusion method named Error Weighted Semi-Coupled Hidden Markov Model (EWSC-HMM) to effectively integrate the advantages of model-level fusion method Semi-Coupled Hidden Markov Model (SC-HMM) and the decision-level fusion method Error Weighted Classifier Combination (EWC) to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relationship between audio and visual streams. The Bayesian classifier weighting scheme EWC is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs to make a final emotion recognition decision. For performance evaluation, two databases are considered: the posed MHMC database and the spontaneous SEMAINE database. Experimental results show that the proposed method not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provide acceptable results for spontaneous expressions. A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation. In this dissertation, we further focused on exploring the temporal evolution of an emotional expression for audio-visual emotion recognition. Previous psychologist research showed that a complete emotional expression can be characterized in three sequential temporal phases: onset (application), apex (release), and offset (relaxation), when considering the manner and intensity of expression. However, a complete emotional expression is expressed by more than one utterance in natural conversation, and in more detail, each utterance may contain several temporal phases of emotional expression. Accordingly, this dissertation further presented a novel data fusion method with respect to the temporal course modeling scheme named Two-Level Hierarchical Alignment-Based Semi-Coupled Hidden Markov model (2H-SC-HMM) to effectively solve the problem of complex temporal structures of an emotional expression and consider the temporal relationship between audio and visual streams for increasing the performance of audio-visual emotion recognition in a conversational utterance. Finally, the experimental results demonstrate that the proposed 2H-SC-HMM substantially improves apparent performance of audio-visual emotion recognition.