Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navig...

Full description

Bibliographic Details
Main Authors:	Jisu Hwang, Incheol Kim
Format:	Article
Language:	English
Published:	MDPI AG 2021-02-01
Series:	Sensors
Subjects:	multimodal embedding natural language instruction panoramic image vision-and-language navigation task deep neural network pretrained model
Online Access:	https://www.mdpi.com/1424-8220/21/3/1012

id	doaj-456be7cc055a404384e1f1aa1284f5bd
record_format	Article
spelling	doaj-456be7cc055a404384e1f1aa1284f5bd2021-02-03T00:04:48ZengMDPI AGSensors1424-82202021-02-01211012101210.3390/s21031012Joint Multimodal Embedding and Backtracking Search in Vision-and-Language NavigationJisu Hwang0Incheol Kim1Department of Computer Science, Kyonggi University, Suwon-si 16227, KoreaDepartment of Computer Science, Kyonggi University, Suwon-si 16227, KoreaDue to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.<b> </b>The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.<b> </b>A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.https://www.mdpi.com/1424-8220/21/3/1012multimodal embeddingnatural language instructionpanoramic imagevision-and-language navigation taskdeep neural networkpretrained model
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Jisu Hwang Incheol Kim
spellingShingle	Jisu Hwang Incheol Kim Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation Sensors multimodal embedding natural language instruction panoramic image vision-and-language navigation task deep neural network pretrained model
author_facet	Jisu Hwang Incheol Kim
author_sort	Jisu Hwang
title	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_short	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_full	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_fullStr	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_full_unstemmed	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_sort	joint multimodal embedding and backtracking search in vision-and-language navigation
publisher	MDPI AG
series	Sensors
issn	1424-8220
publishDate	2021-02-01
description	Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.<b> </b>The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.<b> </b>A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.
topic	multimodal embedding natural language instruction panoramic image vision-and-language navigation task deep neural network pretrained model
url	https://www.mdpi.com/1424-8220/21/3/1012
work_keys_str_mv	AT jisuhwang jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation AT incheolkim jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation
_version_	1724290273520910336

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Similar Items