| Summary: | The heterogeneity and complexity of multi-modal data in high-resolution remote sensing images posed a severe challenge to existing cross-modal networks that aim to fuse complementary information of high-resolution optical and elevation data information (DSM) to achieve accurate semantic segmentation. To solve this problem, a weighted feature fusion network based on large kernel convolution and Transformer (LTFCNet) was proposed. The model uses two parallel encoders to extract the features of different modalities, an improved cross-fusion module to enhance the encoder’s feature extraction capability, and a gate module based on large kernel and Transformer to achieve multi-modal fusion. Finally, a Difference information Feature Fusion Module (DFFM) leveraging attention to differential regions is used to achieve cross-level feature fusion and enhance small object detection. To evaluate the network, we compare it with several state-of-the-art models (SOTA), using the Potsdam and Vaihingen datasets. The experimental results demonstrate that the proposed model outperforms other SOTA models by approximately 2% in the mIoU metric, validating its effectiveness in multi-modal feature fusion.
|