Attentively-Coupled Long Short-Term Memory for Audio-Visual Emotion Recognition

碩士 === 國立成功大學 === 資訊工程學系 === 107 === With the continuous evolution of human-computer interaction products, many smart products can support our daily needs, such as smart speakers, home robots and self-driving cars. In the interaction with these products, the ability to add emotion recognition to use...

Full description

Bibliographic Details
Main Authors: Jia-HaoHsu, 徐嘉昊
Other Authors: Chung-Hsien Wu
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/3e43ac
Description
Summary:碩士 === 國立成功大學 === 資訊工程學系 === 107 === With the continuous evolution of human-computer interaction products, many smart products can support our daily needs, such as smart speakers, home robots and self-driving cars. In the interaction with these products, the ability to add emotion recognition to users will make these products more humane and increase the flexibility of interaction. There have been more and more studies on emotion recognition. In the existing audio-visual modal emotion recognition systems, only few of them focused on segment-based recognition of emotion expression, contrast to utterance-based emotion recognition. From the segment-based emotion expression, we can find the fluctuations of the more detailed expression of emotion. This thesis uses segments as the identification unit to capture the facial expressions and audio signals of the speakers, considers and analyzes the different features of the facial and audio signals, and considers the pre- and post-dependence of the segmented signals. In the segmentation process, an important segment that has a great influence on the expression of the whole sentence is firstly found, and the segment is given a higher attention in the overall recognition to improve the recognition accuracy of each segment. Different from single-modal emotion recognition, multi-modal emotion recognition architecture considers the data from different modalities. This thesis focuses on how to improve the fusion mechanism to improve the performance of segment-based emotion recognition by using a attentively-coupled long-term memory model. With the attention mechanism, in each fusion operation, the coupling unit can simultaneously consider the mutual influence relationship of the two modal signal characteristics when updating the unit, and add the degree of attention of each sequential segment for emotion recognition. The long-short term memory is adopted to control the flow of information to learn the long and short-term dependence of the signal. The model obtains the emotion prediction sequence of each segment, and expects to recognize the emotion from both facial and audio emotion expressions of the speaker. In the experimental results, the accuracy of the proposed audio-visual emotion recognition system achieved 70.1%, which outperformed other existing traditional audio-visual emotion recognition systems. The experimental results showed that the proposed attentively-coupled long short-term memory model achieved good results in multi-modal emotion recognition or emotion recognition using segment-based attention.