Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively mo...

Full description

Bibliographic Details
Main Authors:	Ilias Papastratis, Kosmas Dimitropoulos, Dimitrios Konstantinidis, Petros Daras
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Computer vision continuous sign language recognition cross-modal learning deep-learning joint latent space
Online Access:	https://ieeexplore.ieee.org/document/9090828/

id	doaj-670962fc1cf947efa1a47fd7b92d156d
record_format	Article
spelling	doaj-670962fc1cf947efa1a47fd7b92d156d2021-03-30T02:42:28ZengIEEEIEEE Access2169-35362020-01-018911709118010.1109/ACCESS.2020.29936509090828Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent SpaceIlias Papastratis0https://orcid.org/0000-0003-4664-2626Kosmas Dimitropoulos1https://orcid.org/0000-0003-1584-7047Dimitrios Konstantinidis2Petros Daras3https://orcid.org/0000-0003-3814-6710Visual Computing Lab, Centre for Research and Technology Hellas-Information Technologies Institute, Thessaloniki, GreeceVisual Computing Lab, Centre for Research and Technology Hellas-Information Technologies Institute, Thessaloniki, GreeceVisual Computing Lab, Centre for Research and Technology Hellas-Information Technologies Institute, Thessaloniki, GreeceVisual Computing Lab, Centre for Research and Technology Hellas-Information Technologies Institute, Thessaloniki, GreeceContinuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classified using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.https://ieeexplore.ieee.org/document/9090828/Computer visioncontinuous sign language recognitioncross-modal learningdeep-learningjoint latent space
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ilias Papastratis Kosmas Dimitropoulos Dimitrios Konstantinidis Petros Daras
spellingShingle	Ilias Papastratis Kosmas Dimitropoulos Dimitrios Konstantinidis Petros Daras Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space IEEE Access Computer vision continuous sign language recognition cross-modal learning deep-learning joint latent space
author_facet	Ilias Papastratis Kosmas Dimitropoulos Dimitrios Konstantinidis Petros Daras
author_sort	Ilias Papastratis
title	Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
title_short	Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
title_full	Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
title_fullStr	Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
title_full_unstemmed	Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space
title_sort	continuous sign language recognition through cross-modal alignment of video and text embeddings in a joint-latent space
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2020-01-01
description	Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classified using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.
topic	Computer vision continuous sign language recognition cross-modal learning deep-learning joint latent space
url	https://ieeexplore.ieee.org/document/9090828/
work_keys_str_mv	AT iliaspapastratis continuoussignlanguagerecognitionthroughcrossmodalalignmentofvideoandtextembeddingsinajointlatentspace AT kosmasdimitropoulos continuoussignlanguagerecognitionthroughcrossmodalalignmentofvideoandtextembeddingsinajointlatentspace AT dimitrioskonstantinidis continuoussignlanguagerecognitionthroughcrossmodalalignmentofvideoandtextembeddingsinajointlatentspace AT petrosdaras continuoussignlanguagerecognitionthroughcrossmodalalignmentofvideoandtextembeddingsinajointlatentspace
_version_	1724184738082586624

Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

Similar Items