A Probabilistic Learning Approach for Modelling Human Activities

Ravi Kiran Sarvadevabhatla

An understanding of methods which impart computers with an ability to learn akin to humans forms the motivation for one of the most active disciplines of computer science - Machine Learning. Machine Learning draws on concepts and results from many fields such as artificial intelligence, statistics and information theory. One discipline which has benefited from machine learning techniques is Computer Vision. Computer Vision deals with the conversion of image(s) into descriptions. Quite often, statistical methods are used to understand image data using models constructed with the aid of geometry, physics and learning theory.

Real-world images are often characterized by uncertainty (`visual noise') that accompanies the data. Probability theory provides a proper framework to model this uncertainty. An attractive view of Computer Vision systems is to view them as models which ``learn'' descriptions while addressing uncertainty using probability. Therefore, machine learning algorithms such as Maximum-Likelihood estimation, Mixture Modelling, Expectation-Maximization, which are rooted in probabilistic reasoning, have becoming popular in Computer Vision. An emerging trend is towards graph based representations called Graphical Models. These representations model the dependencies embedded among visual data and address uncertainty via probabilistic inference.

The automatic deduction of the structure of a possibly dynamic 3-D world from 2-D images is an important problem in Computer Vision, particularly when the 2-D images contain people. There has been considerable progress in the areas of Computer Vision such as object recognition, image understanding and scene reconstruction from image(s). This progress, coupled with the improvements in computational power, has prompted a new research focus of making machines that can see people, recognize them and interpret their activities. This has spawned applications in various domains including surveillance, sign language recognition and Human-Computer Interaction ( HCI ).

In this thesis, a new framework to solve the problem of recognizing dynamic activities is presented, in the spirit of aforementioned probabilistic learning. An activity is defined as a finite spatio-temporal change in the state of an entity (e.g. a human being waving hands ). Activities, in turn, are composed of spatio-temporal units called actions (e.g. `sitting' and `standing up' are two predominant actions within the activity 'squatting') . Many human activities contain common actions which causes image data to be highly correlated and containing redundancies. Two dimensional image redundancies are routinely exploited in image processing algorithms. In video, an additional temporal redundancy exists due to smooth variation of the visual scene over time. Given a set of human activity videos, these redundancies are utilized to learn a compact representation for various actions and subsequently, activities. The spatial correlations are captured with a Mixture of Factor Analyzers (MFA) model. This is essentially a reduced dimensionality mixture-of-Gaussians graphical model which performs clustering of actions among the activity data in a low-dimensional fashion. The probabilistic structure underlying the activities is inferred using Expectation Maximization algorithm. The temporal aspect of each activities is stored as an action transition matrix. Given a hitherto unseen activity video, the embedded actions are extracted and recognition performed using probabilistic inference.

This new representation for human activities is intuitive, simple to learn and presents computational and theoretical advantages. Results on the recognition of various human form activities have been presented in this thesis. These highlight the suitability of the developed framework for applications involving whole body activity recognition, including real-time applications. However, the framework is not limited to whole body activities alone and can be used for recognizing hand gestures and other human activities. A discussion on the future directions for the proposed framework is also presented in this thesis.


Year of completion:  2004
 Advisor :

C. V. Jawahar

Related Publications

  • S. S. Ravi Kiran, Karteek Alahari and C. V. Jawahar, Recognizing Human Activities from Constituent Actions, Proceedings of the National Conference on Communications (NCC), Jan. 2005, Kharagpur, India, pp. 351-355. [PDF]

  • C. V. Jawahar, MNSSK Pavan Kumar and S. S. Ravikiran - A Bilingual OCR system for Hindi-Telugu Documents and its Applications, Proceedings of the International Conference on Document Analysis and Recognition(ICDAR) Aug. 2003, Edinburgh, Scotland, pp. 408--413. [PDF]

  • MNSSK Pavan Kumar, S. S. Ravikiran, Abhishek Nayani, C. V. Jawahar and P. J. Narayanan - Tools for Developing OCRs for Indian Scripts, Proceedings of the Workshop on Document Image Analysis and Retrieval(DIAR:CVPR'03), Jun. 2003, Madison, WI. [PDF]