Visual recognition of human communication

The objective of this work is visual recognition of speech and gestures. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi- talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recogniti...

Full description

Bibliographic Details
Main Author:	Chung, Joon Son
Other Authors:	Zisserman, Andrew
Published:	University of Oxford 2017
Online Access:	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.740921

id	ndltd-bl.uk-oai-ethos.bl.uk-740921
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-7409212019-01-08T03:16:25ZVisual recognition of human communicationChung, Joon SonZisserman, Andrew2017The objective of this work is visual recognition of speech and gestures. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi- talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications. However, visual recognition of speech and gestures is a challenging problem, in part due to the lack of annotations and datasets, but also due to the inter- and intra-personal variations, and in the case of visual speech, ambiguities arising from homophones. Training a deep learning algorithm requires a lot of training data. We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript. To build such dataset, it is essential to know 'who' is speaking 'when'. We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronisation and active speaker detection. Not only does this play a crucial role in building the dataset that forms the basis of much of the research done in this thesis, the method learns powerful representations of the visual and auditory inputs which can be used for related tasks such as lip reading. We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images. We then propose a number of deep learning models that are able to recognise visual speech at word and sentence level. In both scenarios, we also demonstrate recognition performance that exceeds the state of the art on public datasets; and in the case of the latter, the lip reading performance beats a professional lip reader on videos from BBC television. We also demonstrate that if audio is available, then visual information helps to improve speech recognition performance. Next, we present a method to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. We propose image encodings and ConvNet-based architectures to first recognise the signal, and then to localise the signal using back-propagation. The method is demonstrated for localising spoken words in audio, and for localising signed gestures in British Sign Language (BSL) videos. Finally, we explore the problem of speaker recognition. Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition dataset collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this dataset.University of Oxfordhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.740921https://ora.ox.ac.uk/objects/uuid:ac44ec7c-20e8-4b04-8d80-66687bd8e881Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
description	The objective of this work is visual recognition of speech and gestures. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi- talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications. However, visual recognition of speech and gestures is a challenging problem, in part due to the lack of annotations and datasets, but also due to the inter- and intra-personal variations, and in the case of visual speech, ambiguities arising from homophones. Training a deep learning algorithm requires a lot of training data. We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript. To build such dataset, it is essential to know 'who' is speaking 'when'. We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronisation and active speaker detection. Not only does this play a crucial role in building the dataset that forms the basis of much of the research done in this thesis, the method learns powerful representations of the visual and auditory inputs which can be used for related tasks such as lip reading. We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images. We then propose a number of deep learning models that are able to recognise visual speech at word and sentence level. In both scenarios, we also demonstrate recognition performance that exceeds the state of the art on public datasets; and in the case of the latter, the lip reading performance beats a professional lip reader on videos from BBC television. We also demonstrate that if audio is available, then visual information helps to improve speech recognition performance. Next, we present a method to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. We propose image encodings and ConvNet-based architectures to first recognise the signal, and then to localise the signal using back-propagation. The method is demonstrated for localising spoken words in audio, and for localising signed gestures in British Sign Language (BSL) videos. Finally, we explore the problem of speaker recognition. Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition dataset collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this dataset.
author2	Zisserman, Andrew
author_facet	Zisserman, Andrew Chung, Joon Son
author	Chung, Joon Son
spellingShingle	Chung, Joon Son Visual recognition of human communication
author_sort	Chung, Joon Son
title	Visual recognition of human communication
title_short	Visual recognition of human communication
title_full	Visual recognition of human communication
title_fullStr	Visual recognition of human communication
title_full_unstemmed	Visual recognition of human communication
title_sort	visual recognition of human communication
publisher	University of Oxford
publishDate	2017
url	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.740921
work_keys_str_mv	AT chungjoonson visualrecognitionofhumancommunication
_version_	1718806746056622080

Visual recognition of human communication

Similar Items