Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitabl...

Full description

Bibliographic Details
Main Authors:	Woon-Haeng Heo, Hyemi Kim, Oh-Wook Kwon
Format:	Article
Language:	English
Published:	MDPI AG 2021-01-01
Series:	Applied Sciences
Subjects:	dilated convolution audio source separation speech enhancement speech recognition music signal separation music identification
Online Access:	https://www.mdpi.com/2076-3417/11/2/789

id	doaj-70890c34dfd24129bd0f8cc2ca990aae
record_format	Article
spelling	doaj-70890c34dfd24129bd0f8cc2ca990aae2021-01-16T00:03:25ZengMDPI AGApplied Sciences2076-34172021-01-011178978910.3390/app11020789Integrating Dilated Convolution into DenseLSTM for Audio Source SeparationWoon-Haeng Heo0Hyemi Kim1Oh-Wook Kwon2School of Electronics Engineering, Chungbuk National University, Cheongju 28644, KoreaCreative Content Research Division, Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaSchool of Electronics Engineering, Chungbuk National University, Cheongju 28644, KoreaHerein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive field in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent influences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identification accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.https://www.mdpi.com/2076-3417/11/2/789dilated convolutionaudio source separationspeech enhancementspeech recognitionmusic signal separationmusic identification
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Woon-Haeng Heo Hyemi Kim Oh-Wook Kwon
spellingShingle	Woon-Haeng Heo Hyemi Kim Oh-Wook Kwon Integrating Dilated Convolution into DenseLSTM for Audio Source Separation Applied Sciences dilated convolution audio source separation speech enhancement speech recognition music signal separation music identification
author_facet	Woon-Haeng Heo Hyemi Kim Oh-Wook Kwon
author_sort	Woon-Haeng Heo
title	Integrating Dilated Convolution into DenseLSTM for Audio Source Separation
title_short	Integrating Dilated Convolution into DenseLSTM for Audio Source Separation
title_full	Integrating Dilated Convolution into DenseLSTM for Audio Source Separation
title_fullStr	Integrating Dilated Convolution into DenseLSTM for Audio Source Separation
title_full_unstemmed	Integrating Dilated Convolution into DenseLSTM for Audio Source Separation
title_sort	integrating dilated convolution into denselstm for audio source separation
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2021-01-01
description	Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive field in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent influences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identification accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.
topic	dilated convolution audio source separation speech enhancement speech recognition music signal separation music identification
url	https://www.mdpi.com/2076-3417/11/2/789
work_keys_str_mv	AT woonhaengheo integratingdilatedconvolutionintodenselstmforaudiosourceseparation AT hyemikim integratingdilatedconvolutionintodenselstmforaudiosourceseparation AT ohwookkwon integratingdilatedconvolutionintodenselstmforaudiosourceseparation
_version_	1724336254896570368

Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

Similar Items