CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Human Pose Estimation: Extension and Application


Digvijay Singh (Homepage)

Abstract

Understanding human appearance in images and videos is one of the most fundamental and explored area in the field. Describing human appearance can be interpreted as a concoction of smaller and more fundamental related aspects like posture, gesture, outlook etc. By doing so we try to grasp holistic sense from semantically lower level information. The utility behind understanding human appearance and related aspects is the industrial demand for applications that involve analyzing humans and their interaction with surroundings. This thesis work tackles two of such related aspects: i. more fundamental problem, human pose estimation and ii. deeper understanding from cloth parsing based on pose estimation.

Determining the human body joint locations and configuration is quizzed as human pose estimation problem. In this work we address the problem of human pose estimation for video sequence data type. We exploit the availability of redundant information from redundant data type. The proposed iteratively functioning methodology has the first iteration involving parsing data from a quintessential generic base model. For the following iterations, we run a 3 step pipeline: grabbing confidently positive detections from previous iteration using our novel selection criteria, fine-tuning external-to-base parameters to local distribution by synthesizing exemplars from picked ones, and enforcing the learned information using an updated amalgamation model. The resulting pipeline propagates correctness in temporal neighborhoods of a video sequence. Previous methods that use the same base models have relied more on tracking strategies. From the unbiased experiments conducted, our approach has proven to be much more robust and overall better performing.

In the second half, we indulge in determining a more deeper understanding of human aspects i.e. cloth parsing. This involves predicting the cloth types worn by humans and their segmented regions in images. Our work focuses on incorporating robustness to a previously formulated method. Conceivably, determining hot regions for each cloth type is dependent on the underlying pose skeleton of the human, eg. hat will be worn on the head. Hence, availability of pose information is key to cloth parsing problem, but incorrect body part estimations can also simply lead to false cloth detections and segmentations. The previous method uses pose information from pictorial structure based model, whereas we update the formulation to incorporate information from more robust part detectors. The changed model has shown to be performing better at more wild outdoor settings. However, the performance from available and proposed methods is below par and appears non-viable for application purposes. To answer this, we report a set of experiments in which we take different lookouts and report observations.

 

Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Digvijay Singh, Vineeth Balasubramanian, C. V. Jawahar - Fine-Tuning Human Pose Estimations in Videos Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV), 2016. [PDF]

  • Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh and C V Jawahar - Parsing Clothes in Unrestricted Images Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]


Downloads

thesis

 ppt

Script and Language Identification for Document Images and Scene Texts


Ajeet Kumar Singh

Abstract

In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, these individual OCRs fail significantly. Hence, in order to recognize texts in the multilingual document image or scene-image, we need to, manually, specify the script or language for each text blocks. Then, the corresponding script/language OCR is applied to recognize the inherent tasks. This is a step which is preventing us to move forward in the direction of fully-automated multi-lingual OCRs.

This thesis presents, two effective solutions to identify the scripts and language of document images and scene-texts, automatically. Even though, recognition problems for scene texts has been highly researched, the script identification problem in this area is relatively new. Hence, we present an approach which represents the scene-text images using mid-level strokes based features which are pooled from the densely computed local features. These features are then classified into languages by using an off-the-shelf classifier. This approach is efficient and require very less labeled data for script identification. The approach has been evaluated on recently introduced video script dataset (CVSI). We also introduce and benchmark a more challenging Indian language Scene Text (ILST) dataset for evaluating the
performance of our method.

For script and language identification in document we investigate the utility of Recurrent Neural Network (RNN). These problems have been attempted in the past with representations computed from the distribution of connected components or characters (e.g. texture, n-gram) from a larger segment (a paragraph or a page). We argue that, one can predict the script or language with minimal evidence (e.g. given only a word or a line) very accurately with the help of a pre-trained RNN. We propose a simple and generic solution for the task of script and language identification without any special tuning. This approach has been verified on a large corpus of more that 15:03M words across 55K documents comprising 15 scripts and languages.

The thesis aims to provide a better recognition solutions in document and scene-texts space by providing two simple, but effective solutions for script and language identification. The proposed algorithms can be used in multilingual settings, where the identification module will first identify the inherent script or language of incoming document or scene-texts before sending them to corresponding script/language recognition module.

Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Minesh Mathew, Ajeet Kumar Singh and C V Jawahar - Multilingual OCR for Indic Scripts - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

  • Ajeet Kumar Singh, Anand Mishra, Pranav Dabral and C V Jawahar - A Simple and Effective Solution for Script Identification in the Wild - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

  • Ajeet Kumar Singh, C. V. Jawahar - Can RNNs Reliably Separate Script and Language at Word and Line Level Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]


Downloads

thesis

 ppt

A Sketch-based Approach for Multimedia Retrieval


Koustav Ghosal (homepage)

Abstract

A hand-drawn sketch is a convenient way to search for an image or a video from a database where examples are unavailable or textual queries are too difficult to articulate. In this thesis, we have tried to propose solutions for some problems in sketch-based multimedia retrieval. In case of image search, the queries could be approximate binary outlines of the actual objects. In case of videos, we consider the case where the user can specify the motion trajectory using a sketch, which is provided as a query.

However there are multiple problems associated with this paradigm. Firstly, different users sketch the same query differently according to their own perception of reality. Secondly, sketches are sparse and abstract representations of images and the two modalities can not be compared directly. Thirdly, compared to images, datasets of sketches are rare. It is very difficult, if not impossible to train a system with sketches of every possible category. The features should be robust enough to retrieve classes that were not a part of training.

In this thesis, the work can be broadly divided into three parts. First, we develop a motion-trajectory based video retrieval strategy and propose a representation for sketches that aims to reduce the perceptual variability among different users. We also propose a novel retrieval strategy, which combines multiple feature representations for a final result using a cumulative scoring mechanism.

In order to tackle the problem of multiple modalities, we propose a sketch-based image retrieval strategy by mapping the two modalities into a lower dimensional sub-space where they are maximally correlated. We use Cluster Canonical Correlation Analysis (c-CCA), a modified version of standard CCA, for the mapping.

Finally, we investigate the use of semantic features derived from a Convolutional Neural Network, and extend the idea of sketch-based image retrieval to the task of zero-shot learning or unknown class retrieval. We define an objective function for the network such that, while training, a close miss is penalized less than a distant miss. Our training encodes semantic similarity among the different classes. We perform experiments to evaluate our algorithms on well known datasets and our results show that
our features perform reasonably well in challenging scenarios.

 

Year of completion:  December 2016
 Advisor : Prof. Anoop M Namboodiri

Related Publications

  • Koustav Ghosal, Ameya Prabhu, Riddhiman Dasgupta, Anoop M. Namboodiri - Learning Clustered Sub-spaces for Sketch-based Image Retrieval Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

  • Koustav Ghosal, Anoop M. Namboodiri - A Sketch-Based Approach to Video Retrieval using Qualitative Features Proceedings of the Ninth Indian Conference on Computer Vision, Graphics and Image Processing, 14-17 Dec 2014, Bangalore, India. [PDF]

 


Downloads

thesis

 ppt

Automatic Analysis of Cricket And Soccer Broadcast Videos


Rahul Anand Sharma (homepage)

Abstract

In the past recent years, there has been a growing need to understand the semantics in sports games. Use of technology in analyzing player movements and understanding the action on a sports field has been growing in the past few years. Most of the systems today make use of certain tracking devices worn by players or markers with sensors placed around the play area. These trackers or markers are electronic devices that communicate with the cameras or cameramen. Other technologies such as the goal line technology popularly used in soccer helps game referees to make accurate decisions that are often misjudged by mere human perception. The primary challenges in these techniques is to make it cost effective and ease of installation and use. It is not convenient to setup markers and sensors around the playing field or to force players to wear certain recording or communication devices without affecting their natural style of playing. Placing a sensor in the game ball also poses a tricky problem of not altering the physical properties of the ball. Sports recorders and broadcasters are now looking for simple and yet effective solutions to get semantic information from a sports game. The big question here is - Can we get sufficient important data only from a video capture just as a human would without relying on external aids of markers and sensors? With advances in various computer vision algorithms and techniques the goal for the future is to analyze everything from captured video. This kind of solution is obviously more attractive to broadcasting and game recording companies as they dont need to setup extra equipment, or influence the authorities to change the match ball or players outfits. We propose a set of algorithms that does the task of automatic analysis for broadcast videos for the sports of Cricket and Soccer.. Using our approach we can automatically detect salient events in Soccer, Temporally align Cricket video with corresponding text commentaries, Localize/Register a soccer image and others. We also compare our algorithms with other state of the art approaches extensively on different datasets for a variety of tasks.

 

Year of completion:  December 2016
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Rahul Anand Sharma, Vineet Gandhi, Visesh Chari and C. V. Jawahar - Automatic analysis of broadcast football videos using contextual priors Signal, Image and Video Processing (SIVP 2016), Volume 10, Issue 5, July, 2016. [PDF]


Downloads

thesis

 ppt

Text Recognition and Retrieval in Natural Scene Images


Udit Roy (homepage)

Abstract

In the past few years, text in natural scene images has gained potential to be a key feature for content based retrieval. They can be extracted and used in search engines, providing relevant information about the images. Robust and efficient techniques from the document analysis and the vision community were borrowed to solve the challenge of digitizing text in such images in the wild. In this thesis, we address the common challenges towards scene text analysis by proposing novel solutions for the recognition and retrieval settings. We develop end to end pipelines which detect and recognize text, the two core challenges of scene text analysis.

For the detection task, we first study and categorize all major publications since 2000 based on their architecture. Broadening the scope of a detection method, we propose a fusion of two complementary styles of detection. The first method evaluates MSER clusters as text or non-text using an adaboost classifier. The method outperforms the other publicly available implementations on standard ICDAR 2011 and MRRC datasets. The second method generates text region proposals using a CNN based text/nontext classifier with high recall. We compare the method with other object region proposal algorithms on the ICDAR datasets and analyse our results. Leveraging on the high recall of the proposals, we fuse the two detection methods to obtain a flexible detection scheme.

For the recognition task, we propose a conditional random field based framework for recognizing word images. We model the character locations as nodes and the bigram interactions as the pairwise potentials. Observing that the interaction potentials computed using the large lexicon are less effective than the small lexicon setting, we propose an iterative method, which alternates between finding the most likely solution and refining the interaction potentials. We evaluate our method on public datasets and obtain nearly 15% improvement in recognition accuracy over baseline methods on the IIIT-5K word dataset with a large lexicon containing 0.5 million words. We also propose a text query based retrieval task for word images and evaluate retrieval performance in various settings.

Finally, we present two contrasting end to end recognition frameworks for scene text analysis on
scene images. The first framework consists of text segmentation and a standard printed text OCR. The text segmented image is fed to Tesseract to get word regions and labels. This case sensitive and lexicon free approach performs at par with the other successful pipelines of the decade on the ICDAR 2003 dataset. The second framework combines the CNN based region proposal method with the CRF based recognizer with various lexicon sizes. Additionally, we also use the latter to retrieve scene images with text queries.

 

Year of completion:  October 2016
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Udit Roy, Anand Mishra, Karteek Alahari, C.V. Jawahar - Scene Text Recognition and Retrieval for Large Lexicons Proceedings of the 12th Asian Conference on Computer Vision,01-05 Nov 2014, Singapore. [PDF] [Abstract] [Poster] [Lexicons] [bibtex]


Downloads

thesis

 ppt

More Articles …

  1. Distinctive Parts for Relative attributes
  2. Tomographic Image Reconstruction in Noisy and Limited Data Settings.
  3. Understanding and Describing Tennis Videos
  4. Playing Poseidon: A Lattice Boltzmann Approach to Simulating Generalised Newtonian Fluids
  • Start
  • Prev
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Thesis
  5. Thesis Students
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.