Understanding Short Social Video Clips using Visual-Semantic Joint Embedding

Aditya Singh 


 The amount of videos recorded and shared on the internet has grown massively in the past decade. The most of it is due to the cheap availability of mobile camera phones and easy access to social media websites and their mobile applications. Applications such as Instagram, Vine, Snapchat allows users to record and share their content in matter of seconds. These three are not the only such media sharing platform available but the number of active monthly users are 600 , 200, and 150 million respectively indicate the interest people have in recording, sharing and viewing their content [1,3]. The number of photos and videos collectively shared on instagram alone crosses 40 billion [1]. Vine contain approximately 40 million videos created by the users [3] and on a daily basis played 1 billions times. This cheaply available mode of data can empower many learning tasks which require huge amount of curated data. Also the videos contain novel viewpoints, and reflect real world dynamics. Different from the content available on older established websites such as Youtube, the content shared here is smaller in length (typically few seconds), contains description and associated hash-tags. Hash-tags can be thought of as keywords assigned by the user to highlight the contextual aspect of the shared media. However, unlike english words these don’t have a definite meaning associated to them as the description is heavily reliant on the content, along which the hash-tags are used. To clearly decipher the meaning of the hash-tag one requires the associated media. Hence, Hash-tags are more ambiguous and difficult to categorise than English words. In this thesis, we attempt to shed some light on applicability and utility of videos shared on a popular social media website vine.co. The videos shared here are called vines and are typically 6 seconds long. They contain with them description composed of a mixture of english words and hash-tags. We try recognising actions and recommend hash-tags to an unseen vine by utilising the visual and the semantic content and the hash-tags provided by the vines respectively. By this we try to show how this untapped resource of popular social media format can prove beneficial for resource intensive tasks which require huge amount of curated data. Action recognition deals with categorising the action being performed in a video to one of the seen categories. With the recent developments, considerable precision is achieved on established datasets. However, in an open world scenario, these approaches fail as the conditions are unpredictable. We show how vines are a much difficult categories of videos with respect to the videos currently in circulation for such tasks. To avoid manual annotations for vines we develop a semi-supervised bootstrapping approach. If one is to manually annotate vines this would defeat the purpose of easily available vines. We iteratively build an efficient classifier which leverages the existing dataset for 7 action categories and also the visual, semantic information present in the vines. The existing dataset forms the source domain and the vines compose the target domain. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlabeled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. Hash-tags form an integral part of vines. Adding more and relevant hash-tags can expand the categories for which a vine can be selected. This enhances the utility of vines by providing missing tags and expanding the scope for the vines. We design a Hash-tag recommendation system to assign tags for an unseen vine from 29 categories. This system uses a vines’ visual content only after accumulating knowledge gathered in an unsupervised fashion. We build a Tag2Vec space from millions of hash-tags using skip-grams using a corpus of 10 million hash-tags. We then train an embedding function to map video features to the low-dimensional Tag2vec space. We learn this embedding for 29 categories of short video clips with hash-tags. A query video without any tag-information can then be directly mapped to the vector space of tags using the learned embedding and relevant tags can be found by performing a simple nearest-neighbor retrieval in the Tag2Vec space. We validate the relevance of the tags suggested by our system qualitatively and quantitatively with a user study.


Year of completion:  December 2017
 Advisor : P J Narayanan

Related Publications

  • Aditya Singh, Saurabh Saini, Rajvi Shah and P. J. Narayanan - Learning to hash-tag videos with Tag2Vec Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]

  • Aditya Singh, Saurabh SainiRajvi Shah and P J Narayanan - From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips 38th German Conference on Pattern Recognition (GCPR 2016) Hannover, Germany, September 12-15 2016. [PDF]

  • Aditya Deshpande, Siddharth Choudhary, P J Narayanan , Krishna Kumar Singh, Kaustav Kundu, Aditya Singh, Apurva Kumar - Geometry Directed Browser for Personal Photographs Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]