Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independentl...

Full description

Bibliographic Details
Main Authors: Kaimin Wei, Zhibo Zhou
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9097848/
id doaj-5509786042714c80b3c750bb7dac6a14
record_format Article
spelling doaj-5509786042714c80b3c750bb7dac6a142021-03-30T02:18:40ZengIEEEIEEE Access2169-35362020-01-018962379624810.1109/ACCESS.2020.29964079097848Adversarial Attentive Multi-Modal Embedding Learning for Image-Text MatchingKaimin Wei0Zhibo Zhou1https://orcid.org/0000-0003-1527-4146College of Information Science and Technology, Jinan University, Guangzhou, ChinaSchool of Computer Science and Engineering, Beihang University, Beijing, ChinaMatching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.https://ieeexplore.ieee.org/document/9097848/Adversarial learningattentionmulti-modalembeddingimage-text matching
collection DOAJ
language English
format Article
sources DOAJ
author Kaimin Wei
Zhibo Zhou
spellingShingle Kaimin Wei
Zhibo Zhou
Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
IEEE Access
Adversarial learning
attention
multi-modal
embedding
image-text matching
author_facet Kaimin Wei
Zhibo Zhou
author_sort Kaimin Wei
title Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_short Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_full Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_fullStr Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_full_unstemmed Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
title_sort adversarial attentive multi-modal embedding learning for image-text matching
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.
topic Adversarial learning
attention
multi-modal
embedding
image-text matching
url https://ieeexplore.ieee.org/document/9097848/
work_keys_str_mv AT kaiminwei adversarialattentivemultimodalembeddinglearningforimagetextmatching
AT zhibozhou adversarialattentivemultimodalembeddinglearningforimagetextmatching
_version_ 1724185404982165504