Learning to Hash-tag Videos with Tag2Vec




User-given tags or labels are valuable resources for semanticunderstanding of visual media such as images and videos. Recently, a new type of labeling mechanism known as hashtagshave become increasingly popular on social media sites. In this paper, we study the problem of generating relevantand useful hash-tags for short video clips. Traditional data driven approaches for tag enrichment and recommendationuse direct visual similarity for label transfer and propagation. We attempt to learn a direct low-cost mapping fromvideo to hash-tags using a two step training. We first employa natural language processing (NLP) technique, Skiagram models with neural network training to learn a low dimensional vector representation of hash-tags (Tag2Vec)using a corpus of ∼10 million hash-tags. We then trainan embedding function to map video features to the low dimensional Tag2vec space. We learn this embedding for 29categories of short video clips withhash-tags. A query videowithout any tag-information can then be directly mappedto the vector space of tags using the learned embedding andrelevant tags can be found by performing a simple nearestneighbor retrieval in the Tag2Vec space. We validate therelevance of the tags suggested by our system qualitativelyand quantitatively with user study.


The distribution of video we are targeting is vine.co. These are recorded by the users under unconstrained environment. These videos contain significant camera shakes, lighting variability, abrupt shots etc. We try to use the folksonomies associated by the uploaders to create a vector space. This Tag2Vec space is trained using ~2million hash tags downloaded for ~15000 categories. The main motivation to create a plug and use system for categorising vines.



  • We create a system for easily tagging vines
  • We provide training sentences comprised of hashtags, and also vines+hash tags for test categories


Name Link Description
Code.tar Link Code for running the main classification along with other utility programs
Vine.tar Link Download link for vines
HashTags.tar Link Training Hashtags
Paper.pdf Link GCPR submission
Supplementary Link Supplementary Submission




Qualitative Results

We conduct a user study where a user is presented with a video and 15 suggested tags by our system. The user marks the relevant tags, in the end we compute the average number of relevant tags across classes.




Associated People

The IIIT-CFW dataset


pro page


The IIIT-CFW is database for the cartoon faces in the wild. It is harvested from Google image search. Query words such as Obama + cartoon, Modi + cartoon, and so on were used to collect cartoon images of 100 public figures. The dataset contains 8928 annotated cartoon faces of famous personalities of the world with varying profession. Additionally, we also provide 1000 real faces of the public figure to study cross modal retrieval tasks, such as, Photo2Cartoon retrieval. The IIIT-CFW can be used for the study spectrum of problems as discussed in our paper.

Keywords: Cartoon faces, face synthesis, sketches, heterogeneous face recognition, cross modal retrieval, caricature.


IIIT-CFW (133.5 MB)
Task 1 -- cartoon face classification
Task 2 -- Photo2Cartoon


Older version:



Related Publications


  • Ashutosh Mishra, Shyam Nandan Rai, Anand Mishra and C. V. Jawahar, IIIT-CFW: A Benchmark Database of Cartoon Faces in the Wild, 1st workshop on visual analysis and sketch (ECCVW) 2016. [PDF]


If you use this dataset, please cite:

  author    = "Mishra, A., Nandan Rai, S., Mishra, A. and Jawahar, C.~V.",
  title     = "IIIT-CFW: A Benchmark Database of Cartoon Faces in the Wild",
  booktitle = "VASE ECCVW",
  year      = "2016",


For any queries about the dataset feel free to contact Anand Mishra. Email:This email address is being protected from spambots. You need JavaScript enabled to view it.

Deep Feature Embedding for Accurate Recognition and Retrieval of Handwritten Text


We propose a deep convolutional feature representation that achieves superior performance for word spotting and recognition for handwritten images. We focus on :- (i) enhancing the discriminative ability of the convolutional features using a reduced feature representation that can scale to large datasets, and (ii) enabling query-by-string by learning a common subspace for image and text using the embedded attribute framework. We present our results on popular datasets such as the IAM corpus and historical document collections from the Bentham and George Washington pages. On the challenging IAM dataset, we achieve a state of the art mAP of 91.58% on word spotting using textual queries and a mean word error rate of 6.69% for the word recognition task.


Motivation MatchingHW 2

Major Contributions

  • Improve the discriminative ability of deep CNN features using HWNet[1] and achieving state of the art results for handwritten word spotting and recognition tasks on IAM, George Washington and Bentham handwritten pages.
  • Learning a reduced feature representation that can scale large datasets.
  • Enabling query-by-string by learning a common subspace for image and text using the embedded attribute framework.

Qualitative Results


Related Publications

  • Praveen Krishnan,  Kartik Dutta and C.V Jawahar - Deep Feature Embedding for Accurate Recognition and Retrieval of Handwritten Text, 15th International Conference on Frontiers in Handwriting Recognition, Shenzhen, China (ICFHR), 2016. [PDF]





Pre Trained CNN Models

Pre Trained Attribute Models


[1] Praveen Krishnan and C.V Jawahar, "Matching Handwritten Document Images", ECCV 2016.

[2] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, "Word spotting and recognition with embedded attributes," PAMI,2014.

[3] Praveen Krishnan and C.V Jawahar, "Generating Synthetic Data for Text Recognition", arXiv:1608.04224, 2016


Associated People

  • Praveen Krishnan *
  • Kartik Dutta *
  • C.V. Jawahar

* Equal Contribution

Visual Aesthetic Analysis for Handwritten Document Images


We  present  an  approach  for  analyzing  the  visual aesthetic  property  of  a  handwritten  document  page  which matches  with  human  perception.  We  formulate  the  problem at two independent levels: (i) coarse level which deals with the overall layout, space usages between lines, words and margins, and (ii) fine level, which analyses the construction of each word and  deals  with  the  aesthetic  properties  of  writing  styles.  We present our observations on multiple local and global features which can extract the aesthetic cues present in the handwritten documents.

Project Page Flowchart 2.0

Major Contributions

  • A novel approach for determining the aesthetic value of Handwritten Document Images.
  • Identified relevant features helpful in quantifying the human notion for aesthetics of handwriting.
  • Dataset of 275 pages of handwritten document images along with their neatness annotations.


Qualitative Results


Qualitative analysis of the proposed method in predicting the aesthetic measure of the document. (a) The top scoring document images, having properly aligned and beautiful handwritten text. (b) The lowest scoring document images where one can observe inconsistent word spacing, skew and highly irregular word formation. (c) Sample pairs of word images from the human verification experiment where the words in the first column are predicted better than the words in the second column whereas the third column denotes whether the prediction agrees with human judgment.

Related Publications


  • Anshuman Majumdar, Praveen Krishnan and C.V. Jawahar - Visual Aesthetic Analysis for Handwritten Document Images,15th International Conference on Frontiers in Handwriting Recognition, Shenzhen, China (ICFHR), 2016. [PDF]


Coming Soon...

Associated People

  • Anshuman Majumdar
  • Praveen Krishnan
  • C.V. Jawahar

Information Retrieval from Large Document Image Collections





We focuses on the challenging problem of Information Retrieval from large document image collections. We propose to develop algorithms and approaches that are scalable to large datasets. We use and extend ideas from machine learning (ML), information retrieval (IR) and computer vision (CV) for this task. Our results are expected to impact the way retrieval is carried out from document images (documents which have textual content and image format). Effective retrieval systems built over textual content (often crawled from web) have changed the way we look at multimedia collections. Since we work on images, traditional IR solutions are not directly applicable. Popular approach is to recognize (e.g. with an OCR) the images and build a textual representation. However recognizers can be brittle and result in noisy outputs in many practical settings (e.g. historic documents, handwritten documents, Indian language documents etc.). We design representations that can scale to millions of document images seamlessly from a small corpus of annotated datasets.

Word Image Retrieval using Bag of Visual Words





In this work, we present a Bag of Visual Words (BoVW) based approach to retrieve similar word images from a large database, efficiently and accurately. We show that a text retrieval system can be adapted to build a word image retrieval solution. This helps in achieving scalability. We demonstrate the method on more than 1 Million word images with a sub-second retrieval time. We validate the method on four Indian languages, and report a mean average precision of more than 0.75. To address the lack of spatial structure in the BoVW representation, we re-rank the retrieved list.


Key Features

  • Language independent system : Demonstrated on 5 different languages.
  • Scalable to huge datasets : Demonstrated on 1 Million images.
  • Handles noisy document images : Demonstrated on dataset for which Commercial OCRs fail.

Related Publications

  • Ravi Sekhar and C V Jawahar - Word Image Retrieval Using Bag of Visual words Proceedings of 10th IAPR International Workshop on Document Analysis Systems 27-29 Mar. 2012, ISBN 978-1-4673-0868-7, pp. 297-301, Queensland, Australia. [PDF] [Poster] [bibtex]



Content Level Access to Digital Library of India Pages


In this work, we propose a framework for content level access to the scanned pages of Digital Library of India (DLI). We propose a search scheme which fuses noisy OCR output and holistic visual features for content level access to the DLI pages. Visual content is captured using Bag of Visual Words (BoVW) approach. The fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10). We exploit the fact that OCR has a high precision while BoVW has a high recall.

Digital Library of India

Digital Library of India (DLI) has emerged as one of the largest collections of document images in Indian scripts. DLI, as a part of Million Book Project (MBP), has contributed to the free access of knowledge to Billions of people. In addition, it also helped in digitally archiving the rare and precious books in many of the Indian languages. All these digital contents are stored as scanned images of printed documents. A major challenge presently faced bythe DLI is the lack of content level access to the individual pages.






Associated People