TFSWA-ResUNet: music source separation with time–frequency sequence and shifted window attention-based ResUNet

Abstract CNN-based UNet is a widely employed network architecture for music source separation (MSS). Meanwhile, spectrogram features of music audio, which consist of both time and frequency information, are commonly used as inputs for MSS tasks. CNN-based UNet models in the spectrogram domain still...

Full description

Bibliographic Details
Published in:EURASIP Journal on Advances in Signal Processing
Main Authors: Zhenyu Yao, Yuping Su, Honghong Yang, Yumei Zhang, Xiaojun Wu
Format: Article
Language:English
Published: SpringerOpen 2025-09-01
Subjects:
Online Access:https://doi.org/10.1186/s13634-025-01249-0
Description
Summary:Abstract CNN-based UNet is a widely employed network architecture for music source separation (MSS). Meanwhile, spectrogram features of music audio, which consist of both time and frequency information, are commonly used as inputs for MSS tasks. CNN-based UNet models in the spectrogram domain still have the limitation that the global and local correlations of spectrograms have not been explored efficiently. In this paper, we propose a novel ResUNet architecture named TFSWA-ResUNet, in which a temporal-frequency and shifted window attention (TFSWA)-based module is designed as UNet’s bottleneck block. In the proposed TFSWA block, the time sequence attention (TSA) block and frequency sequence attention (FSA) block are used to capture the global correlations of music spectrogram features within time and frequency sequence, respectively. To further capture the local correlations of spectrogram features, a shifted window attention-based Swin transformer is also introduced into the TFSWA module, which computes self-attention within local non-overlapping windows and captures correlations from both the temporal and frequency dimensions. Experimental results on the MUSDB18 dataset indicate that the proposed TFSWA-ResUNet model can achieve significant separation performance with a relatively small number of parameters, which demonstrates that our approach offers a good tradeoff between performance and computational cost, making it a feasible and widely adoptable solution.
ISSN:1687-6180