Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independentl...

Full description

Bibliographic Details
Main Authors:	Kaimin Wei, Zhibo Zhou
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Adversarial learning attention multi-modal embedding image-text matching
Online Access:	https://ieeexplore.ieee.org/document/9097848/

id	doaj-5509786042714c80b3c750bb7dac6a14
record_format	Article
spelling	doaj-5509786042714c80b3c750bb7dac6a142021-03-30T02:18:40ZengIEEEIEEE Access2169-35362020-01-018962379624810.1109/ACCESS.2020.29964079097848Adversarial Attentive Multi-Modal Embedding Learning for Image-Text MatchingKaimin Wei0Zhibo Zhou1https://orcid.org/0000-0003-1527-4146College of Information Science and Technology, Jinan University, Guangzhou, ChinaSchool of Computer Science and Engineering, Beihang University, Beijing, ChinaMatching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.https://ieeexplore.ieee.org/document/9097848/Adversarial learningattentionmulti-modalembeddingimage-text matching
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Kaimin Wei Zhibo Zhou
spellingShingle	Kaimin Wei Zhibo Zhou Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching IEEE Access Adversarial learning attention multi-modal embedding image-text matching
author_facet	Kaimin Wei Zhibo Zhou
author_sort	Kaimin Wei
title	Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_short	Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_full	Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_fullStr	Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_full_unstemmed	Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_sort	adversarial attentive multi-modal embedding learning for image-text matching
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2020-01-01
description	Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.
topic	Adversarial learning attention multi-modal embedding image-text matching
url	https://ieeexplore.ieee.org/document/9097848/
work_keys_str_mv	AT kaiminwei adversarialattentivemultimodalembeddinglearningforimagetextmatching AT zhibozhou adversarialattentivemultimodalembeddinglearningforimagetextmatching
_version_	1724185404982165504

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Similar Items