Learning relational event models from videos

Learning event models from videos has applications ranging from abnormal event detection to content based video retrieval. When multiple agents are involved in the events, characterizing events naturally suggests encoding interactions as relations. This can be realized by tracking the objects using...

Full description

Bibliographic Details
Main Author: Dubba, Krishna Sandeep Reddy
Published: University of Leeds 2012
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.590428
Description
Summary:Learning event models from videos has applications ranging from abnormal event detection to content based video retrieval. When multiple agents are involved in the events, characterizing events naturally suggests encoding interactions as relations. This can be realized by tracking the objects using computer vision algorithms and encoding the interactions using qualitative spatial and temporal relations. Learning event models from this kind of relational spatio-temporal data is particularly challenging because of the presence of multiple objects, uncertainty from the tracking and especially the time component as this increases the size of the relational data (the number of temporal relational facts is quadratically proportional to the number of intervals present). Relational learning techniques such as Inductive Logic Programming (ILP) hold promise for building models from this kind of data. but have not been successfully applied to the very large datasets which result from video data. In this thesis, we present a novel supervised learning framework to learn relational event models from large video datasets (several million frames) using ILP. Efficiency is achieved via the learning from interpretations setting and using a typing system that exploits the type hierarchy of objects: in a domain. We also present a type. refining operator and prove that it is optimal. Positive and negative examples are extracted using domain experts' minimal event annotations (termed deictic supervision) which are used for learning relational event models. These models can be used for recognizing events from unseen videos. If the input data is from sensors, it is prone to noise and to handle this, we present extensions to the original framework by integrating abduction as well as extending the framework based on Markov Logic Networks to obtain robust probabilistic models that improve the event recognition performance. The experimental results on video data from two challenging real world domains (an airport domain which has events such as loading, unloading, passenger-bridge parking etc. and a verbs domain which has verbs like exchange, pick-up etc.) suggest that the techniques are suitable to real world scenarios.