Fine-Grained Descriptions for Domain Specific Videos




In this work, we attempt to describe videos from a specific domain - broadcast videos of lawn tennis matches. Given a video shot from a tennis match, we intend to generate a textual commentary similar to what a human expert would write on a sports website. Unlike many recent works that focus on generating short captions, we are interested in generating semantically richer descriptions. This demands a detailed low-level analysis of the video content, specially the actions and interactions among subjects. We address this by limiting our domain to the game of lawn tennis. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task. We demonstrate the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce semantically rich and human-like descriptions. The proposed method can be well adopted to situations where activities are in a limited context and the linguistic diversity is confined.







Supplementary Video



Related Publications


Associated People

Fine-Tuning Human Pose Estimation in Videos

Digvijay Singh     Vineeth Balasubramanian     C. V. Jawahar


fine tuning dg page 001We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos that provides accurate estimations even for complex sequences. We surpass state-of-the-art on most of the datasets used and also show a gain over the baseline on our new dataset of unrestricted sports videos. The self-training model presented has two components: a static Pictorial Structure based model and a dynamic ensemble of exemplars. We present a pose quality criteria that is primarily used for batch selection and automatic parameter selection. The same criteria works as a low-level pose evaluator used in post-processing. We set a new challenge by introducing a full human body-parts annotated complex dataset, CVIT-SPORTS, which contains complex videos from the sports domain. The strength of our method is demonstrated by adapting to videos of complex activities such as cricket-bowling, cricket-batting, football as well as available standard datasets.

Here we release our implementation of [1] for MATLAB software. To read more about the method, check the pdf on the left.



FilenameDescription Size
fine_tuning_pose.tar.gz Matlab code for fine-tuning human pose estimation in videos. 94 MB
README Description on running the code and other info. 4.0 KB
cvit_sports_videos.tar.gz CVIT-SPORTS-Videos dataset of 11 video sequences from cricket domain. 66 MB


[1] D. Singh, V. Balasubramanian, C. V. Jawahar. Fine-Tuning Human Pose Estimations in Videos . WACV 2016. 

[2] Y. Yang, D. Ramanan. Articulated Pose Estimation using Flexible Mixtures of Parts. CVPR 2011.

[3] A. Cherian, J. Marial, K. Alahari, C. Schmid. Mixing Body-Part Sequences for Human Pose Estimation. CVPR 2014.

[4] B. Sapp, D. Weiss, B. Taskar. Parsing Human Motion with Stretchable Models. CVPR 2011.

[5] T. Malisiewicz, A. Gupta, A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. ICCV 2011.


Fine-Grain Annotation of Cricket Videos


animation  animation animation

 It's a sample video taken of IPL channel from Youtube

The recognition of human activities is one of the key problems in video understanding. Action recognition is challenging even for specific categories of videos, such as sports, that contain only a small set of actions. Interestingly, sports videos are accompanied by detailed commentaries available online, which could be used to perform action annotation in a weakly-supervised setting.

For the specific case of Cricket videos, we address the challenge of temporal segmentation and annotation of actions with semantic descriptions. Our solution consists of two stages. In the first stage, the video is segmented into ``scenes'', by utilizing the scene category information extracted from text-commentary. The second stage consists of classifying video-shots as well as the phrases in the textual description into various categories. The relevant phrases are then suitably mapped to the video-shots. The novel aspect of this work is the fine temporal scale at which semantic information is assigned to the video. As a result of our approach, we enable retrieval of specific actions that last only a few seconds, from several hours of video. This solution yields a large number of labelled exemplars, with no manual effort, that could be used by machine learning algorithms to learn complex actions.


In this paper, we present a solution that enables rich semantic annotation of Cricket videos at a fine temporal scale. Our approach circumvents technical challenges in visual recognition by utilizing information from online text-commentaries. We obtain a high annotation accuracy, as evaluated over a large video collection. The annotated videos shall be made available for the community for benchmarking, such a rich dataset is not yet available publicly. In future work, the obtained labelled datasets could be used to learn classifiers for fine-grain activity recognition and understanding.



In the first stage, the goal is to align the two modalities at a “scene” level. This stage consists of a joint synchronisation and segmentation of the video with the text commentary.

Scene Segmentation:

A typical scene in a Cricket match follows the sequence of events depicted in figure


Model Learning

It was observed that the visual-temporal patterns of the scenes are conditioned on the outcome of the event. In other words, a 1-Run outcome is visually different from a 4-Run outcome

FourModel2   1RunModel2
 Four Run Model One Run Model


While the visual model described above could be useful in recognizing the scene category for a given video segment, it cannot be immediately used to spot the scene in a full-length video. Conversely, the temporal segmentation of the video is not possible without using an appropriate model for the individual scenes themselves. This chicken-and-egg problem can be solved by utilizing the scene-category information from the parallel text-commentary.


Our dataset is collected from the YouTube channel for the Indian Premier League(IPL) tournament.
The Commentary data is collected from CricInfo by web Crawling.

Name Matches No. of Phrases Role
IPL Video Dataset 4 Matches (20Hrs) 960 Phrases Video/Shot Recognition
CricInfo Dataset 300 Matches 1500 Bowler Phrases and 6000 Batsmen Phrases Text Classification




improvedExample   figure


R Bowler Shot Batsman Shot
2 22.15 39.4
4 43.37 47.6
6 69.09 69.6
8 79.94 80.8
10 87.87 88.95
R is Neighborhood to search for correct Shot
Kernel Vocab: 300 Vocab: 1000
Linear 78.02 82.25
Polynomial 80.15 81.16
RBF 81 82.15
Sigmoid 77.88 80.53

Vocab denotes visual vocabulary Size
Results are after applying CRF








Related Publications:

Rahul Anand Sharma, Pramod Sankar, C. V. Jawahar - Fine-Grain Annotation of Cricket Videos Proceedings of the Third Asian Conference on Pattern Recognition 3-6 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

Code and Dataset:

  • Code can be Downloaded here Code
  • Crinfo Commentary Dataset can be Downloaded here Dataset
  • IPL Dataset is available on IPL's official youtube channel IPL
  • Divide Each Match into 10 over chunks and run the above code. Refer README inside code directory.
  • In case of any queries/doubts please contact      This email address is being protected from spambots. You need JavaScript enabled to view it.

Associated People:

Fine Pose Estimation of Known Objects from a Single Scene Image


Understanding the precise 3D structure of an environ- ment is one of the fundamental goals of computer vision and is challenging due to a variety of factors such as ap- pearance variation, illumination, pose, noise, occlusion and scene clutter. A generic solution to the problem is ill-posed due to the loss of depth information during imaging. In this paper, we consider a specific but common situation, where the scene contains known objects. Given 3D models of a set of known objects and a cluttered scene image, we try to detect these objects in the image, and align 3D models to their images to find their exact pose. We develop an ap- proach that poses this as a 3D-to-2D alignment problem. We also deal with pose estimation of 3D articulated objects in images. We evaluate our proposed method on BigBird dataset and our own tabletop dataset, and present experi- mental comparisons with state-of-the-art methods.

scene 10 dense 30 slr rgb cropped NP3 3 1 scene 15 XXX 45 SLR RGB cropped
scene 10 dense 30 slr rgb cropped cadfitted NP3 3 scene 15 XXX 45 SLR RGB cropped cadfitted

The idea of the problem is given a single cluttered scene image, is it possible to find the pose of objects in it. As observed, our method is robust to objects of various shape, sizes and texture and also in terms of object-object occlusion.


We build on previous approaches like Jim et el. [2] and Aubry et al. [3]. Jim et al. use a user-driven set of 2D-to-3D correspondences to extract geometric cues to refine the correspondence set to obtain the pose using ICP. Aubry et al. on the other hand, use whitened HoG as discriminative features to select the closest matching image from a set of rendered viewpoints as the final pose. Hence, the final pose is dependant on the sampling of the camera viewpoints. Our goal differs from the above primarily in its assumptions about the objects and their models. We try to find the pose of common tabletop objects that could potentially be low-lying and the models are assumed to be without any texture, making the alignment problem hard. Our major contributions include: 1) An ensemble of shape features that work well for aligning textureless 3D models, 2) A two- stage alignment scheme that is efficient and accurate and 3) An extension of the proposed approach to handle articulated objects. We demonstrate our results on a variety of tabletop objects including transparent ones and scene images with occlusion and background clutter. Note that textureless 3D models are used to generalize our proposed method to objects that are very similar in shape and size, but vary in texture. Experimental results show that the proposed method outperforms the state-of-the-art method for pose estimation.

Method Overview




We evaluate our proposed method on BigBird [4] dataset and our own dataset of common household objects: TableTop. Both BigBird and TableTop contains RGB images of a single object at various orientations, captured under controlled environment. Additionally, TableTop dataset contains cluttered scene images having multiple objects having occlusion constraints. The dataset statistics are given below:

For each object, images are captured at an azimuth interval of 3° and elevation interval of 18°, making a total of 120 images per elevation angle.
Number of objects: 15
Number of images per object: 600
Total: 9000

For each object, images are captured at an azimuth interval of 100 and at elevations of 0°, 15°, 30°, 45°, and 60°, and one image from 90° elevation.
Number of objects: 50
Number of images per object: 181
Total: 9050

Results and Comparisons

Qualitative Results:

IMG 3037 scene 6 medium 30 slr rgb cropped scene 7 medium 45 slr rgb cropped scene 9 dense 45 slr rgb cropped
IMG 3037 scene 6 medium 30 slr rgb cadfitted2 scene 7 medium 45 slr rgb cadfitted scene 9 dense 45 slr rgb cadfitted

 Top row: Input scene images Bottom row: Scene images with 3D model poses. Models corresponding to the objects are fetched from the repository and superimposed with their correct pose.


Quantitative Results:


rough pose fine pose
Left: Error in initial hypothesis. Right: Error in refined pose. As observed, out method refines upto 6° of groundtruth for 80% of the examples.


Qualitative Comparison:

airplane test image superimposed test image superimposed my
hourglass hourglass superimposed hourglass superimposed my

Qualitative comparison of proposed method with S3DC. As observed from the figure, proposed method is able to provide a more refined pose compared to S3DC.

Qualitative Comparison:

  Transparent Bottle Sand Clock Pen Stand Scissors Spectacles Case
  Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE
S3DC 31.11 0.147 83.33 0.086 91.11 0.093 6.67 0.4 44.44 0.016
Ours 83.52 0.0016 86.81 0.033 83.52 0.002 73.63 0.086 54.65 0.121

We compare the classification accuracy and mean pose error (MPE) with S3DC, for some of the objects from our dataset with varying complexity in terms of shape, size and material.


  BigBird TableTop
  Accuracy (%) MPE Accuracy (%) MPE
S3DC 34.5 0.013 45.7 0.044
Ours 49.7 0.008 67.3 0.021

We compare the classification accuracy and mean pose error (MPE) with S3DC, for some of the objects from our dataset with varying complexity in terms of shape, size and material.

Related Publications

  • Sudipto Banerjee, Sanchit Aggarwal, Anoop M. Namboodiri - Fine Pose Estimation of Known Objects in Cluttered Scene Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

  • M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic - Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models Computer Vision and Pattern Recognition, 2014. [PDF]
  • J. J. Lim, H. Pirsiavash, and A. Torralba - Parsing ikea objects: Fine pose estimation International Conference on Computer Vision, 2013. [PDF]
  • A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel - Bigbird: A large-scale 3d database of object instances International Conference on Robotics and Automation, 2014. [PDF]

Code and Dataset

  • Code: To be updated soon
  • BigBird dataset can be downloaded from here
  • TableTop dataset can be downloaded here

Associated People



Online Handwriting Recognition using Depth Sensors




In this work, we propose an online handwriting solution, where the data is captured with the help of depth sensors. User writes in air and our method recognizes it online in real time using the novel representation of features. Our method uses an efficient fingertip tracking approach and reduces the necessity of pen-up/pen-down switching. We validate our method on two depth sensors, Kinect and Leap Motion Controller and use state-of-the-art classifiers to recognize characters. On a sample dataset collected from 20 users, we report 97.59% accuracy for character recognition. We also demonstrate how this system can be extended for lexicon recognition with 100% accuracy. We have prepared a dataset containing 1,560 characters and 400 words with intention of providing common benchmark for air handwriting character recognition and allied research.



Character samples from Dataset

To evaluate the performance of our approach, we created a dataset, 'Dataset for AIR Handwriting' (DAIR). The dataset is created using 20 subjects, where each user stands straight in front of the sensor and writes in the air with one finger out. Users are allowed to write at their own speed and writing style. Dataset contains two sections DAIR I and DAIR II. DAIR I  consists of 1248 character samples from 16 users by taking 3 samples per character per user. DAIR II consists of words from a lexicon of length 40. Words in the lexicon are taken from the names of most populous cities and vary in length from 3 to 5. It contains 400 words which totals to 1490 characters.

Our datasets can be downloaded from the following links:

Please mail us at {sirnam.swetha@research, rajat.aggarwal@students} for any queries.


final result

Labels represent the predicted and expected characters/words in each pair. Each pair has the input trajectory and trajectory after normalization respectively. (a) samples correctly classified, (b) mis-classified samples, (c) correctly classified words.



Accuracy comparison of characters using Kinect and Leap Motion Controller

Related Publications

Rajat Aggarwal, Sirnam Swetha, Anoop M. Namboodiri, Jayanthi Sivaswamy, C. V. Jawahar - Online Handwriting Recognition using Depth Sensors Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]

Associated People