EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the...

Full description

Bibliographic Details
Main Authors:	Northcutt, Curtis George (Contributor), Zha, Shengxin (Author), Lovegrove, Steven (Author), Newcombe, Richard (Author)
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Contributor)
Format:	Article
Language:	English
Published:	Institute of Electrical and Electronics Engineers (IEEE), 2021-06-07T14:17:15Z.
Subjects:	Article
Online Access:	Get fulltext


LEADER	02123 am a22001933u 4500
001	130907
042			\|a dc
100	1	0	\|a Northcutt, Curtis George \|e author
100	1	0	\|a Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science \|e contributor
100	1	0	\|a Northcutt, Curtis George \|e contributor
700	1	0	\|a Zha, Shengxin \|e author
700	1	0	\|a Lovegrove, Steven \|e author
700	1	0	\|a Newcombe, Richard \|e author
245	0	0	\|a EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset
260			\|b Institute of Electrical and Electronics Engineers (IEEE), \|c 2021-06-07T14:17:15Z.
856			\|z Get fulltext \|u https://hdl.handle.net/1721.1/130907
520			\|a Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5% of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79% relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.
655	7		\|a Article
773			\|t IEEE Transactions on Pattern Analysis and Machine Intelligence

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Similar Items