Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored ch...

Full description

Bibliographic Details
Published in:Sensors
Main Authors: Nicholas Moratelli, Manuele Barraco, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Format: Article
Language:English
Published: MDPI AG 2023-01-01
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/3/1286
_version_ 1850419576128405504
author Nicholas Moratelli
Manuele Barraco
Davide Morelli
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
author_facet Nicholas Moratelli
Manuele Barraco
Davide Morelli
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
author_sort Nicholas Moratelli
collection DOAJ
container_title Sensors
description Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through <i>k</i>-nearest neighbor (<i>k</i>NN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.
format Article
id doaj-art-822bf3ff80cf417e9e9116e2f2c38bdf
institution Directory of Open Access Journals
issn 1424-8220
language English
publishDate 2023-01-01
publisher MDPI AG
record_format Article
spelling doaj-art-822bf3ff80cf417e9e9116e2f2c38bdf2025-08-19T22:43:23ZengMDPI AGSensors1424-82202023-01-01233128610.3390/s23031286Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive GatesNicholas Moratelli0Manuele Barraco1Davide Morelli2Marcella Cornia3Lorenzo Baraldi4Rita Cucchiara5Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, ItalyDepartment of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, ItalyDepartment of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, ItalyDepartment of Education and Humanities, University of Modena and Reggio Emilia, 42121 Reggio Emilia, ItalyDepartment of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, ItalyDepartment of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, ItalyResearch related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through <i>k</i>-nearest neighbor (<i>k</i>NN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.https://www.mdpi.com/1424-8220/23/3/1286image captioningfashion captioningknowledge retrievalvision-and-language
spellingShingle Nicholas Moratelli
Manuele Barraco
Davide Morelli
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
image captioning
fashion captioning
knowledge retrieval
vision-and-language
title Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
title_full Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
title_fullStr Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
title_full_unstemmed Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
title_short Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
title_sort fashion oriented image captioning with external knowledge retrieval and fully attentive gates
topic image captioning
fashion captioning
knowledge retrieval
vision-and-language
url https://www.mdpi.com/1424-8220/23/3/1286
work_keys_str_mv AT nicholasmoratelli fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates
AT manuelebarraco fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates
AT davidemorelli fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates
AT marcellacornia fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates
AT lorenzobaraldi fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates
AT ritacucchiara fashionorientedimagecaptioningwithexternalknowledgeretrievalandfullyattentivegates