Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition

Scene recognition is one of the hot topics in micro-video understanding, where multi-modal information is commonly used due to its efficient representation ability. However, there are some challenges in the usage of multi-modal information because the semantic consistency among multiple modalities i...

Full description

Bibliographic Details
Main Authors: Jie Guo, Xiushan Nie, Yilong Yin
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8993777/
id doaj-33a3ea8682d84afb99f11696bcbfd8f8
record_format Article
spelling doaj-33a3ea8682d84afb99f11696bcbfd8f82021-03-30T02:06:06ZengIEEEIEEE Access2169-35362020-01-018295182952410.1109/ACCESS.2020.29732408993777Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene RecognitionJie Guo0https://orcid.org/0000-0002-2859-0159Xiushan Nie1Yilong Yin2School of Computer Science and Technology, Shandong University, Jinan, ChinaSchool of Computer Science and Technology, Shandong Jianzhu University, Jinan, ChinaSchool of Software, Shandong University, Jinan, ChinaScene recognition is one of the hot topics in micro-video understanding, where multi-modal information is commonly used due to its efficient representation ability. However, there are some challenges in the usage of multi-modal information because the semantic consistency among multiple modalities in micro-videos is weaker than in traditional videos, and the influences of multi-modal information in micro-videos are always different. To address these issues, a multi-modal enhancement semantic learning method is proposed for micro-video scene recognition in this study. In the proposed method, the visual modality is considered the main modality whereas other modalities such as text and audio are considered auxiliary modalities. We propose a deep multi-modal fusion network for scene recognition with enhanced the semantics of auxiliary modalities using the main modality. Furthermore, the fusion weight of multi-modal can be adaptively learned in the proposed method. The experiments demonstrate the effectiveness of enhancement and adaptive weight learning in the multi-modal fusion of the micro-video scene recognition.https://ieeexplore.ieee.org/document/8993777/Micro-video scene recognitionmulti-modal fusionsemantic enhancementadaptive weight learning
collection DOAJ
language English
format Article
sources DOAJ
author Jie Guo
Xiushan Nie
Yilong Yin
spellingShingle Jie Guo
Xiushan Nie
Yilong Yin
Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
IEEE Access
Micro-video scene recognition
multi-modal fusion
semantic enhancement
adaptive weight learning
author_facet Jie Guo
Xiushan Nie
Yilong Yin
author_sort Jie Guo
title Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
title_short Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
title_full Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
title_fullStr Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
title_full_unstemmed Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition
title_sort mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Scene recognition is one of the hot topics in micro-video understanding, where multi-modal information is commonly used due to its efficient representation ability. However, there are some challenges in the usage of multi-modal information because the semantic consistency among multiple modalities in micro-videos is weaker than in traditional videos, and the influences of multi-modal information in micro-videos are always different. To address these issues, a multi-modal enhancement semantic learning method is proposed for micro-video scene recognition in this study. In the proposed method, the visual modality is considered the main modality whereas other modalities such as text and audio are considered auxiliary modalities. We propose a deep multi-modal fusion network for scene recognition with enhanced the semantics of auxiliary modalities using the main modality. Furthermore, the fusion weight of multi-modal can be adaptively learned in the proposed method. The experiments demonstrate the effectiveness of enhancement and adaptive weight learning in the multi-modal fusion of the micro-video scene recognition.
topic Micro-video scene recognition
multi-modal fusion
semantic enhancement
adaptive weight learning
url https://ieeexplore.ieee.org/document/8993777/
work_keys_str_mv AT jieguo mutualcomplementaritymultimodalenhancementsemanticlearningformicrovideoscenerecognition
AT xiushannie mutualcomplementaritymultimodalenhancementsemanticlearningformicrovideoscenerecognition
AT yilongyin mutualcomplementaritymultimodalenhancementsemanticlearningformicrovideoscenerecognition
_version_ 1724185719134486528