MR-Former: Improving universal image segmentation via refined masked-attention transformer
Mask-based universal image segmentation architectures, which focus attention within the mask region, have significantly enhanced both segmentation accuracy and speed, becoming a key component of universal image segmentation frameworks. However, existing methods often generate masks from the predicte...
| Published in: | Alexandria Engineering Journal |
|---|---|
| Main Authors: | , , , , |
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-11-01
|
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1110016825010361 |
| _version_ | 1848757121680670720 |
|---|---|
| author | Xingliang Zhu Weiwei Yu Xiaoyu Dong Wei Zhang Bin Kong |
| author_facet | Xingliang Zhu Weiwei Yu Xiaoyu Dong Wei Zhang Bin Kong |
| author_sort | Xingliang Zhu |
| collection | DOAJ |
| container_title | Alexandria Engineering Journal |
| description | Mask-based universal image segmentation architectures, which focus attention within the mask region, have significantly enhanced both segmentation accuracy and speed, becoming a key component of universal image segmentation frameworks. However, existing methods often generate masks from the predicted probability map using a hard threshold, which may introduce biased guidance due to missing or inaccurate object predictions. To address this issue, we propose a novel mask refinement method, called MR-Former. The method divides the segmentation output into two categories: ‘object’ and ‘non-object’ masks. For ‘object’ masks, we further refine the mask into interior and edge regions, and apply independent attention branches to learn their respective attention distributions, which are then adaptively fused. For ‘non-object’ masks, we introduce a mask focusing technique that objects key regions to ensure attention is concentrated on high-probability areas, effectively eliminating irrelevant distractions. Experimental results demonstrate that MR-Former, with a ResNet-50 backbone, significantly improves segmentation performance across multiple state-of-the-art architectures, including Mask2Former, OneFormer, and PEM. On the Cityscapes dataset, MR-Former achieves improvements of up to 0.9 PQ, 0.7 AP, and 1.3 mIoU with a maximum Params increase of only 1.4%; on the ADE20K dataset, it shows improvements of up to 1.1 PQ, 1.4 AP, and 1.3 mIoU accompanied by a maximum FLOPs increment of just 1G; and on the Mapillary Vistas dataset, it demonstrates improvements of up to 1.1 PQ and 1.2 mIoU with a maximum FPS loss of merely 3.4. |
| format | Article |
| id | doaj-art-fd0e2355a86e4c028cdf3cdedc52cfcd |
| institution | Directory of Open Access Journals |
| issn | 1110-0168 |
| language | English |
| publishDate | 2025-11-01 |
| publisher | Elsevier |
| record_format | Article |
| spelling | doaj-art-fd0e2355a86e4c028cdf3cdedc52cfcd2025-10-16T05:01:32ZengElsevierAlexandria Engineering Journal1110-01682025-11-0113123224410.1016/j.aej.2025.09.072MR-Former: Improving universal image segmentation via refined masked-attention transformerXingliang Zhu0Weiwei Yu1Xiaoyu Dong2Wei Zhang3Bin Kong4Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China; University of Science and Technology of China, Hefei, 230026, ChinaHefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China; University of Science and Technology of China, Hefei, 230026, ChinaHefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China; University of Science and Technology of China, Hefei, 230026, ChinaHefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China; University of Science and Technology of China, Hefei, 230026, ChinaHefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China; University of Science and Technology of China, Hefei, 230026, China; Anhui Engineering Laboratory for Intelligent Driving Technology and Application, Hefei, Anhui, China; Innovation Research Institute of Robotics and Intelligent Manufacturing(Hefei), Chinese Academy of Sciences, Hefei, Anhui, China; Corresponding author at: Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China.Mask-based universal image segmentation architectures, which focus attention within the mask region, have significantly enhanced both segmentation accuracy and speed, becoming a key component of universal image segmentation frameworks. However, existing methods often generate masks from the predicted probability map using a hard threshold, which may introduce biased guidance due to missing or inaccurate object predictions. To address this issue, we propose a novel mask refinement method, called MR-Former. The method divides the segmentation output into two categories: ‘object’ and ‘non-object’ masks. For ‘object’ masks, we further refine the mask into interior and edge regions, and apply independent attention branches to learn their respective attention distributions, which are then adaptively fused. For ‘non-object’ masks, we introduce a mask focusing technique that objects key regions to ensure attention is concentrated on high-probability areas, effectively eliminating irrelevant distractions. Experimental results demonstrate that MR-Former, with a ResNet-50 backbone, significantly improves segmentation performance across multiple state-of-the-art architectures, including Mask2Former, OneFormer, and PEM. On the Cityscapes dataset, MR-Former achieves improvements of up to 0.9 PQ, 0.7 AP, and 1.3 mIoU with a maximum Params increase of only 1.4%; on the ADE20K dataset, it shows improvements of up to 1.1 PQ, 1.4 AP, and 1.3 mIoU accompanied by a maximum FLOPs increment of just 1G; and on the Mapillary Vistas dataset, it demonstrates improvements of up to 1.1 PQ and 1.2 mIoU with a maximum FPS loss of merely 3.4.http://www.sciencedirect.com/science/article/pii/S1110016825010361Universal image segmentationMasked transformerRefined masked attention |
| spellingShingle | Xingliang Zhu Weiwei Yu Xiaoyu Dong Wei Zhang Bin Kong MR-Former: Improving universal image segmentation via refined masked-attention transformer Universal image segmentation Masked transformer Refined masked attention |
| title | MR-Former: Improving universal image segmentation via refined masked-attention transformer |
| title_full | MR-Former: Improving universal image segmentation via refined masked-attention transformer |
| title_fullStr | MR-Former: Improving universal image segmentation via refined masked-attention transformer |
| title_full_unstemmed | MR-Former: Improving universal image segmentation via refined masked-attention transformer |
| title_short | MR-Former: Improving universal image segmentation via refined masked-attention transformer |
| title_sort | mr former improving universal image segmentation via refined masked attention transformer |
| topic | Universal image segmentation Masked transformer Refined masked attention |
| url | http://www.sciencedirect.com/science/article/pii/S1110016825010361 |
| work_keys_str_mv | AT xingliangzhu mrformerimprovinguniversalimagesegmentationviarefinedmaskedattentiontransformer AT weiweiyu mrformerimprovinguniversalimagesegmentationviarefinedmaskedattentiontransformer AT xiaoyudong mrformerimprovinguniversalimagesegmentationviarefinedmaskedattentiontransformer AT weizhang mrformerimprovinguniversalimagesegmentationviarefinedmaskedattentiontransformer AT binkong mrformerimprovinguniversalimagesegmentationviarefinedmaskedattentiontransformer |
