Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navig...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-02-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/21/3/1012 |
id |
doaj-456be7cc055a404384e1f1aa1284f5bd |
---|---|
record_format |
Article |
spelling |
doaj-456be7cc055a404384e1f1aa1284f5bd2021-02-03T00:04:48ZengMDPI AGSensors1424-82202021-02-01211012101210.3390/s21031012Joint Multimodal Embedding and Backtracking Search in Vision-and-Language NavigationJisu Hwang0Incheol Kim1Department of Computer Science, Kyonggi University, Suwon-si 16227, KoreaDepartment of Computer Science, Kyonggi University, Suwon-si 16227, KoreaDue to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.<b> </b>The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.<b> </b>A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.https://www.mdpi.com/1424-8220/21/3/1012multimodal embeddingnatural language instructionpanoramic imagevision-and-language navigation taskdeep neural networkpretrained model |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jisu Hwang Incheol Kim |
spellingShingle |
Jisu Hwang Incheol Kim Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation Sensors multimodal embedding natural language instruction panoramic image vision-and-language navigation task deep neural network pretrained model |
author_facet |
Jisu Hwang Incheol Kim |
author_sort |
Jisu Hwang |
title |
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation |
title_short |
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation |
title_full |
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation |
title_fullStr |
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation |
title_full_unstemmed |
Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation |
title_sort |
joint multimodal embedding and backtracking search in vision-and-language navigation |
publisher |
MDPI AG |
series |
Sensors |
issn |
1424-8220 |
publishDate |
2021-02-01 |
description |
Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.<b> </b>The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.<b> </b>A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets. |
topic |
multimodal embedding natural language instruction panoramic image vision-and-language navigation task deep neural network pretrained model |
url |
https://www.mdpi.com/1424-8220/21/3/1012 |
work_keys_str_mv |
AT jisuhwang jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation AT incheolkim jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation |
_version_ |
1724290273520910336 |