Bayesian Nonparameric Modeling of Temporal Coherence for Entity-driven Video Analytics

 

Abstract:

Due to the advent of video-­sharing sites like Youtube, online user­generated video content is increasing very rapidly. To simplify search of meaningful information from such huge volume of content, Computer Vision researchers have started to work on problems like Video Summarization and Scene Discovery from videos. People understand videos based on high­-level semantic concepts. But most of the current research in video analytics makes use of low­level features and descriptors, which may not have semantic interpretation. We have aimed to fill in this gap, by modeling implicit structural information about videos, such as spatio­temporal properties. We have represented videos as a collection of semantic visual concepts which we call “entities”, such as persons in a movie.To aid these tasks, we have attempted to model the important property oftemporal coherence”, which means that adjacent frames are likely to have similar visual features, and contain the same set of entities. Bayesian nonparametrics is a natural way of modeling all these, but they have also given rise to the need for new models and algorithms.

A tracklet is a spatio­temporal fragment of a video­ a set of spatial regions in a short sequence of consecutive frames, each of which enclose a particular entity. We first attempt to find a representation of tracklets to aid tracking entities across videos using region descriptors like Covariance Matrices of spatial features making use of temporal coherence. Next, we move to modeling temporal coherence at a semantic level. Each tracklet is associated to an entity. Spatio­-temporally close but non­overlapping tracklets are likely to belong to the same entity, while tracklets that overlap in time can never belong to the same entity. The aim is to cluster the tracklets based on the entities associated with them, with the goal of discovering the entities in the video along with all their occurrences. We represented an entity with a mixture component, and proposed a temporally coherent version of Chinese Restaurant Process (TC­CRP) that can encode the constraints easily. TC­CRP shows excellent performance on person discovery from TV ­series videos. We also discuss semantic video summarization, based on entity discovery. Next, we considered entity­driven temporal segmentation of the video into scenes, where each scene is modeled as a sparse distribution over entities.. We proposed EntScene: a generative model for videos based on entities and scenes, and also an inference algorithm based on Dynamically Blocked Gibbs Sampling. We experimentally found significant improvements in terms of segmentation and scene discovery compared to alternative inference algorithms.