EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the...

Full description

Bibliographic Details
Main Authors:	Lovegrove, S. (Author), Newcombe, R. (Author), Northcutt, C.G (Author), Zha, S. (Author)
Format:	Article
Language:	English
Published:	NLM (Medline) 2023
Subjects:	algorithm Algorithms artificial intelligence Artificial Intelligence Bayes theorem Bayes Theorem Communication human Humans interpersonal communication machine learning Machine Learning
Online Access:	View Fulltext in Publisher View in Scopus


LEADER	02529nam a2200337Ia 4500
001	10.1109-TPAMI.2020.3025105
008	230529s2023 CNT 000 0 und d
020			\|a 19393539 (ISSN)
245	1	0	\|a EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset
260		0	\|b NLM (Medline) \|c 2023
300			\|a 11
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1109/TPAMI.2020.3025105
856			\|z View in Scopus \|u https://www.scopus.com/inward/record.uri?eid=2-s2.0-85159552347&doi=10.1109%2fTPAMI.2020.3025105&partnerID=40&md5=64d5e80dad532a71a12f9689cf723385
520	3		\|a Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.
650	0	4	\|a algorithm
650	0	4	\|a Algorithms
650	0	4	\|a artificial intelligence
650	0	4	\|a Artificial Intelligence
650	0	4	\|a Bayes theorem
650	0	4	\|a Bayes Theorem
650	0	4	\|a Communication
650	0	4	\|a human
650	0	4	\|a Humans
650	0	4	\|a interpersonal communication
650	0	4	\|a machine learning
650	0	4	\|a Machine Learning
700	1	0	\|a Lovegrove, S. \|e author
700	1	0	\|a Newcombe, R. \|e author
700	1	0	\|a Northcutt, C.G. \|e author
700	1	0	\|a Zha, S. \|e author
773			\|t IEEE transactions on pattern analysis and machine intelligence

EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

Similar Items