Towards Efficient and Scalable Visual Processing in Images and Videos

Mihir Jain

The amount of multimedia content produced and made available on Internet and in professional and personal collections is constantly growing. Equally increasing are the needs in terms of efficient and effective ways to manage it. This has led to a great amount of research into content based retrieval and visual recognition. In this thesis, we focus on efficient visual content analysis in images and videos. Efficiency has emerged as one of the key issues with increase in quantity of data. Understanding of a visual content has several aspects associated with it. One can concentrate on recognizing the inherent characteristics of image (independent or from a video) like objects, scene and context. Searching for a sequence of images based on similarity or characterizing the video based on its visual content could be some other aspects.

We investigate three different approaches for visual content analysis in this thesis. In the first, we target the detection and classification of different object and scene classes in images and videos. The task of classification is to predict the presence of an object or a specific scene of interest in the test image. Object detection further involves localizing each instance of the object present. We do extensive experimentation over very large and challenging datasets with large number of object and scene categories in it. Our detection as well as classification are based on Random Forest combined with combinations of different visual features describing shape, appearance and color. We exploited the computational efficiency in both training and testing, and other properties of Random Forest for detection and classification. We also proposed enhancements over our baseline model of object detector. Our main contribution here is that we achieve fast object detection with accuracy comparable to the state of art.

The second approach is based on processing continuous stream of videos to detect video segments of interest. Our method is example-based where visual content to be detected or filtered is characterized by a set of examples available apriori. We approach the problem of video processing in a manner complimentary to that of video retrieval. We begin with a set of examples (used as queries in retrieval) and index them in the database. The larger video collection, which needs to be processed, is unseen during the off-line indexing phase. We propose an architecture based on trie data structure and bag of words model to simultaneously match multiple example videos in the database with the input large video stream. We demonstrate the application of our architecture for the task of content based copy detection (CBCD).

In our third and final approach we apply pattern mining algorithms in videos to characterize the visual content. They are derived out of data mining schemes for efficient analysis of the content in video databases. Two different video mining schemes are employed; both aimed at detecting frequent and representative patterns. For one of our mining approaches, we use an efficient frequent pattern mining algorithm over a quantized feature space. Our second approach uses random forest to represent video data as sequences, and mine the frequent sequences. We experiment on broadcast news videos to detect what we define as video stop-words and extract the contents which are more important such as breaking news. We are also able to characterize the movie videos by automatically identifying the characteristic scenes and main actors of the movie.

The ideas proposed in the thesis have been implemented and validated with extensive experimental results. We demonstrate the accuracy, efficiency and scalability of all the three approaches over large and standrad datasets like VOC PASCAL, TRECVID, MUSCLE-VCD as well as movie and news datasets.


Year of completion:  2010
 Advisor : C. V. Jawahar

Related Publications

  • Mihir Jain, Sreekanth Vempati, Chandrika Pulla and C.V. Jawahar - Example Based Video Filters Proceedings of the 8th International Conference on Image and Video Retrieval (CIVR 2009), July . 8-10, 2009, Santorini, Greece. [PDF]

  • Sreekanth Vempati, Mihir Jain, Omkar Parkhi and C. V. Jawahar - Andrea Vedaldi, Marcin Marszalek and Andrew Zisserman In Proceedings of the TREC Video Retrieval (TRECVID) Workshop organized by NIST, Gaithersburg, USA, 2009.
  • James Philbin, Manuel Marin-Jimenez, Siddharth Srinivasan and Andrew Zisserman, Mihir Jain, Sreekanth Vempati, Pramod Sankar and C. V. Jawahar, In Proceedings of the TREC Video Retrieval (TRECVID) Workshop organized by NIST, Gaithersburg, USA, 2008