High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism

In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether...

Full description

Bibliographic Details
Main Authors: Tianhao Qiao, Shunqing Zhang, Shan Cao, Shugong Xu
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/21/16/5500
id doaj-56ab849c8e554f609064660f5d55aab0
record_format Article
spelling doaj-56ab849c8e554f609064660f5d55aab02021-08-26T14:19:16ZengMDPI AGSensors1424-82202021-08-01215500550010.3390/s21165500High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention MechanismTianhao Qiao0Shunqing Zhang1Shan Cao2Shugong Xu3Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200444, ChinaShanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200444, ChinaShanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200444, ChinaShanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200444, ChinaIn the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improving the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.https://www.mdpi.com/1424-8220/21/16/5500environmental sound classificationconvolutional recurrent neural networksub-spectrogram segmentationscore level fusiontemporal-frequency attention mechanism
collection DOAJ
language English
format Article
sources DOAJ
author Tianhao Qiao
Shunqing Zhang
Shan Cao
Shugong Xu
spellingShingle Tianhao Qiao
Shunqing Zhang
Shan Cao
Shugong Xu
High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
Sensors
environmental sound classification
convolutional recurrent neural network
sub-spectrogram segmentation
score level fusion
temporal-frequency attention mechanism
author_facet Tianhao Qiao
Shunqing Zhang
Shan Cao
Shugong Xu
author_sort Tianhao Qiao
title High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
title_short High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
title_full High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
title_fullStr High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
title_full_unstemmed High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism
title_sort high accurate environmental sound classification: sub-spectrogram segmentation versus temporal-frequency attention mechanism
publisher MDPI AG
series Sensors
issn 1424-8220
publishDate 2021-08-01
description In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and we adopt the proposed convolutional recurrent neural network (CRNN) for improving the classification accuracy. By evaluating numerous truncation schemes, we numerically figure out the optimal number of sub-spectrograms and the corresponding band ranges, and, on this basis, we propose a joint attention mechanism with temporal and frequency attention mechanisms and use the global attention mechanism when generating the attention map. Finally, the numerical results show that the two frameworks we proposed can achieve 82.1% and 86.4% classification accuracy on the public environmental sound dataset ESC-50, respectively, which is equivalent to more than 13.5% improvement over the traditional baseline scheme.
topic environmental sound classification
convolutional recurrent neural network
sub-spectrogram segmentation
score level fusion
temporal-frequency attention mechanism
url https://www.mdpi.com/1424-8220/21/16/5500
work_keys_str_mv AT tianhaoqiao highaccurateenvironmentalsoundclassificationsubspectrogramsegmentationversustemporalfrequencyattentionmechanism
AT shunqingzhang highaccurateenvironmentalsoundclassificationsubspectrogramsegmentationversustemporalfrequencyattentionmechanism
AT shancao highaccurateenvironmentalsoundclassificationsubspectrogramsegmentationversustemporalfrequencyattentionmechanism
AT shugongxu highaccurateenvironmentalsoundclassificationsubspectrogramsegmentationversustemporalfrequencyattentionmechanism
_version_ 1721190040486281216