Learning to recognise visual content from textual annotation

This thesis explores how machine learning can be applied to the task of learning to recognise visual content from different forms of textual annotation, bringing together computer vision and natural language processing. The data used in the thesis is taken from real world sources including broadcast...

Full description

Bibliographic Details
Main Author: Marter, Matthew John
Other Authors: Bowden, Richard ; Hadfield, Simon
Published: University of Surrey 2019
Subjects:
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.767000
Description
Summary:This thesis explores how machine learning can be applied to the task of learning to recognise visual content from different forms of textual annotation, bringing together computer vision and natural language processing. The data used in the thesis is taken from real world sources including broadcast television and photographs harvested from the internet. This leads to very few constraints on the data meaning there can be large variations in lighting, facial expression, visual properties of objects and camera angles. These sources provide the levels of data required to support modern machine learning approaches. However, annotation and or ground truth are not available and potentially expensive to obtain. This work therefore, will employ weak textual annotation in the form of subtitles, scripts, captions and tags. The use of weak textual annotation means that different techniques are also required to handle the natural language that is used to describe the visual content. Character identification is a challenge that requires a different approach due to the similarities that will be shared between all faces. As with location recognition, the script is aligned with the video using subtitles. Faces are detected using a face detector and facial landmarks are regressed. These facial landmarks are used to create a descriptor for the face. Multiple techniques are used to assign the faces identities from the script. In the first technique, facial descriptors are clustered to the number of characters and the size of the clusters matched with the screen time of the character. In the second technique, a random forest is trained to differentiate between different faces, and the splitting criteria are used to reduce the dimensionality of the facial features. The reduced dimensionality allows a distribution of facial features to be created per scene. Then rules are created to separate scenes and identify distributions for individual characters. As well as this, data harvested from the internet is used to learn the appearance of the actors in the video and then matched to the characters. Using the various techniques give a character labelling performance of up to 82.75% accuracy using a SIFT-based descriptor and up to 96.82% using a state of the art descriptor. Automatic caption generation for images is a relatively new and complex topic as it requires both understanding of the visual content in the image and the formation of natural language. Deep learning is powerful for object recognition and provides excellent performance on image recognition data sets. Pretrained convolutional neural networks (CNNs) were fine tuned using the parts of speech (POS) extracted from the natural language captions. A probabilistic language model can be created from the captions in the training data and be used to recreate new sentences for unseen images. To better model more complex language rules, a recurrent neural network (RNN) is used to generate sentences directly from features extracted from a CNN. An RNN that uses attention to look at different parts of an image can also utilise the final layers of a CNN to provide context for the whole image. Location recognition, character identification and RNNs are combined to automatically generate descriptions for broadcast television using character and location names. This creates a full pipeline for automatically labelling an unseen episode of a television series. Compared with ground truth input of location and characters, only a small drop in performance occurs when using labels predicted by computer vision and machine learning techniques. Using ground truth, a CIDEr score of 1.585 is achieved compared with 1.343 for a fully predicted system. Data providing emotional context for words and images allow the RNN to be used to manipulate the emotional context for images. Subjective testing shows that the output captions are more emotive than captions generated without emotional context 74.85% of the time, and are almost equal to human written captions. Adjusting the emotional context is shown to generate captions that alter the content to reflect the emotion. The fusion of computer vision and natural language processing through machine learning represents an important step for both fields.