Exploration of an Independent Training Framework for Speech Emotion Recognition

Speech emotion recognition (SER) plays an indispensable role in human-computer interaction tasks, where the ultimate performance is determined by features, such as empirically learned features (ELFs) and automatically learned features (ALFs). Although the fusion of both ELFs and ALFs can complement...

Full description

Bibliographic Details
Main Authors: Shunming Zhong, Baoxian Yu, Han Zhang
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9290046/
id doaj-1613dc5dac9d4bc8823f0f3eee6968aa
record_format Article
spelling doaj-1613dc5dac9d4bc8823f0f3eee6968aa2021-03-30T04:29:35ZengIEEEIEEE Access2169-35362020-01-01822253322254310.1109/ACCESS.2020.30438949290046Exploration of an Independent Training Framework for Speech Emotion RecognitionShunming Zhong0https://orcid.org/0000-0002-8321-8616Baoxian Yu1https://orcid.org/0000-0002-8068-3766Han Zhang2https://orcid.org/0000-0002-4037-3026School of Physics and Telecommunication Engineering, South China Normal University (SCNU), Guangzhou, ChinaSchool of Physics and Telecommunication Engineering, South China Normal University (SCNU), Guangzhou, ChinaSchool of Physics and Telecommunication Engineering, South China Normal University (SCNU), Guangzhou, ChinaSpeech emotion recognition (SER) plays an indispensable role in human-computer interaction tasks, where the ultimate performance is determined by features, such as empirically learned features (ELFs) and automatically learned features (ALFs). Although the fusion of both ELFs and ALFs can complement some new features for SER, the fused training within one softmax layer is inappropriate due to the different performance of using either ELFs or ALFs for emotion recognition. Based on this consideration, this paper proposes an independent training framework that can fully enjoy the complementary advantages of human knowledge and powerful learning ability of deep learning models. Specifically, we first feed Mel frequency cepstral coefficient and openSMILE features respectively into a pair of independent models, which are composed of an attention-based convolution long short-term memory neural network and a fully connected neural network. We then design a feedback mechanism for each model to extract ALFs and ELFs independently, where hard example mining and re-training with a hard example loss are applied to focus the feature extraction on hard examples during training. Finally, a classifier is adopted to distinguish emotion by using both the independent features of ALFs and ELFs. Based on extensive experiments on three public speech emotion datasets (IEMOCAP, EMODB, and CASIA), we show that the proposed independent training framework outperforms the conventional feature fusion methods.https://ieeexplore.ieee.org/document/9290046/Data imbalancehard examplefeature fusionindependent trainingspeech emotion recognition
collection DOAJ
language English
format Article
sources DOAJ
author Shunming Zhong
Baoxian Yu
Han Zhang
spellingShingle Shunming Zhong
Baoxian Yu
Han Zhang
Exploration of an Independent Training Framework for Speech Emotion Recognition
IEEE Access
Data imbalance
hard example
feature fusion
independent training
speech emotion recognition
author_facet Shunming Zhong
Baoxian Yu
Han Zhang
author_sort Shunming Zhong
title Exploration of an Independent Training Framework for Speech Emotion Recognition
title_short Exploration of an Independent Training Framework for Speech Emotion Recognition
title_full Exploration of an Independent Training Framework for Speech Emotion Recognition
title_fullStr Exploration of an Independent Training Framework for Speech Emotion Recognition
title_full_unstemmed Exploration of an Independent Training Framework for Speech Emotion Recognition
title_sort exploration of an independent training framework for speech emotion recognition
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Speech emotion recognition (SER) plays an indispensable role in human-computer interaction tasks, where the ultimate performance is determined by features, such as empirically learned features (ELFs) and automatically learned features (ALFs). Although the fusion of both ELFs and ALFs can complement some new features for SER, the fused training within one softmax layer is inappropriate due to the different performance of using either ELFs or ALFs for emotion recognition. Based on this consideration, this paper proposes an independent training framework that can fully enjoy the complementary advantages of human knowledge and powerful learning ability of deep learning models. Specifically, we first feed Mel frequency cepstral coefficient and openSMILE features respectively into a pair of independent models, which are composed of an attention-based convolution long short-term memory neural network and a fully connected neural network. We then design a feedback mechanism for each model to extract ALFs and ELFs independently, where hard example mining and re-training with a hard example loss are applied to focus the feature extraction on hard examples during training. Finally, a classifier is adopted to distinguish emotion by using both the independent features of ALFs and ELFs. Based on extensive experiments on three public speech emotion datasets (IEMOCAP, EMODB, and CASIA), we show that the proposed independent training framework outperforms the conventional feature fusion methods.
topic Data imbalance
hard example
feature fusion
independent training
speech emotion recognition
url https://ieeexplore.ieee.org/document/9290046/
work_keys_str_mv AT shunmingzhong explorationofanindependenttrainingframeworkforspeechemotionrecognition
AT baoxianyu explorationofanindependenttrainingframeworkforspeechemotionrecognition
AT hanzhang explorationofanindependenttrainingframeworkforspeechemotionrecognition
_version_ 1724181730921807872