CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Understanding Short Social Video Clips using Visual-Semantic Joint Embedding


Aditya Singh 

Abstract

 The amount of videos recorded and shared on the internet has grown massively in the past decade. The most of it is due to the cheap availability of mobile camera phones and easy access to social media websites and their mobile applications. Applications such as Instagram, Vine, Snapchat allows users to record and share their content in matter of seconds. These three are not the only such media sharing platform available but the number of active monthly users are 600 , 200, and 150 million respectively indicate the interest people have in recording, sharing and viewing their content [1,3]. The number of photos and videos collectively shared on instagram alone crosses 40 billion [1]. Vine contain approximately 40 million videos created by the users [3] and on a daily basis played 1 billions times. This cheaply available mode of data can empower many learning tasks which require huge amount of curated data. Also the videos contain novel viewpoints, and reflect real world dynamics. Different from the content available on older established websites such as Youtube, the content shared here is smaller in length (typically few seconds), contains description and associated hash-tags. Hash-tags can be thought of as keywords assigned by the user to highlight the contextual aspect of the shared media. However, unlike english words these don’t have a definite meaning associated to them as the description is heavily reliant on the content, along which the hash-tags are used. To clearly decipher the meaning of the hash-tag one requires the associated media. Hence, Hash-tags are more ambiguous and difficult to categorise than English words. In this thesis, we attempt to shed some light on applicability and utility of videos shared on a popular social media website vine.co. The videos shared here are called vines and are typically 6 seconds long. They contain with them description composed of a mixture of english words and hash-tags. We try recognising actions and recommend hash-tags to an unseen vine by utilising the visual and the semantic content and the hash-tags provided by the vines respectively. By this we try to show how this untapped resource of popular social media format can prove beneficial for resource intensive tasks which require huge amount of curated data. Action recognition deals with categorising the action being performed in a video to one of the seen categories. With the recent developments, considerable precision is achieved on established datasets. However, in an open world scenario, these approaches fail as the conditions are unpredictable. We show how vines are a much difficult categories of videos with respect to the videos currently in circulation for such tasks. To avoid manual annotations for vines we develop a semi-supervised bootstrapping approach. If one is to manually annotate vines this would defeat the purpose of easily available vines. We iteratively build an efficient classifier which leverages the existing dataset for 7 action categories and also the visual, semantic information present in the vines. The existing dataset forms the source domain and the vines compose the target domain. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlabeled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. Hash-tags form an integral part of vines. Adding more and relevant hash-tags can expand the categories for which a vine can be selected. This enhances the utility of vines by providing missing tags and expanding the scope for the vines. We design a Hash-tag recommendation system to assign tags for an unseen vine from 29 categories. This system uses a vines’ visual content only after accumulating knowledge gathered in an unsupervised fashion. We build a Tag2Vec space from millions of hash-tags using skip-grams using a corpus of 10 million hash-tags. We then train an embedding function to map video features to the low-dimensional Tag2vec space. We learn this embedding for 29 categories of short video clips with hash-tags. A query video without any tag-information can then be directly mapped to the vector space of tags using the learned embedding and relevant tags can be found by performing a simple nearest-neighbor retrieval in the Tag2Vec space. We validate the relevance of the tags suggested by our system qualitatively and quantitatively with a user study.

 

Year of completion:  December 2017
 Advisor : P J Narayanan

Related Publications

  • Aditya Singh, Saurabh Saini, Rajvi Shah and P. J. Narayanan - Learning to Hash-Tag Videos with Tag2Vec, Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]

  • Aditya Singh, Saurabh Saini, Rajvi Shah and P J Narayanan - From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips 38th German Conference on Pattern Recognition (GCPR 2016) Hannover, Germany, September 12-15 2016. [PDF]

  • Aditya Deshpande, Siddharth Choudhary, P J Narayanan , Krishna Kumar Singh, Kaustav Kundu, Aditya Singh, Apurva Kumar - Geometry Directed Browser for Personal Photographs Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing (ICVGIP), 16-19 Dec. 2012, Bombay, India. [PDF]


Downloads

thesis

Fine Pose Estimation and Region Proposals from a Single Image


Sudipto Banerjee (Home Page)

Abstract

Understanding the precise 3D structure of an environment is one of the fundamental goals of computer vision and is challenging due to a variety of factors such as appearance variation, illumination, pose, noise, occlusion and scene clutter. A generic solution to the problem is ill-posed due to the loss of depth information during imaging. In this paper, we consider a specific but common situation, where the scene contains known objects. Given 3D models of a set of known objects and a cluttered scene image, we try to detect these objects in the image, and align 3D models to their images to find their exact pose. We develop an approach that poses this as a 3D-to-2D alignment problem. We also deal with pose estimation of 3D articulated objects in images. We evaluate our proposed method on BigBird dataset and our own tabletop dataset, and present experimental comparisons with state-of-the-art methods. In order to find the pose of an object, we come up with a hierarchical approach whereby we first an initial estimate of the pose and thereby refine it using a robust algorithm. Obtaining the initial estimate is crucial as the refinement is entirely dependant on it. Estimating the object proposals or region proposals from an image is a well-known but difficult task, as the complexity of the problem intensifies due to the presence of object-object interaction and background clutter. We tackle the problem by coming up with a robust Convolutional Neural Network based method which learns object proposals in a supervised manner. As we need region proposals at object level, we solve the problem of instance-level semantic segmentation, where each pixel in the image is classified into one of the known classes. Moreover, two pixels are labelled differently if they belong to two different instances of the same class. We show quantitative and qualitative comparison of our proposed network models with previous approaches, and show our results on the challenging PASCAL VOC dataset.


Year of completion:  March 2018
 Advisor : Anoop M Namboodiri

Related Publications

  • Sudipto Banerjee, Sanchit Aggarwal, Anoop M. Namboodiri - Fine Pose Estimation of Known Objects in Cluttered Scene Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]


Downloads

thesis

Efficient Annotation of Objects for Video Analysis


Swetha Sirnam (Home Page)

Abstract

The field of computer vision is rapidly expanding and has significantly more processing power and memory today, than in previous decades. Video has become one of the most popular visual media for communication and entertainment. In particular, automatic analysis and understanding the content of a video is one of the long-standing goals of computer vision. One of the fundamental problems is to model the appearance and behavior of the objects in videos. Such models mainly depend on the problem definition. Typically, in many scenarios, the change in problem statement is followed by the changes in the annotation and its complexities. Creating large-scale datasets in this scenario using the manual annotation process is monotonous, time-consuming and non-scalable. In order to address this challenge and strive towards practical large scale annotated video datasets, we investigate methods to autonomously learn and adapt object models using temporal information in videos. Even though the vision community has advanced in field of problem solving but data generation and annotation is still a tough problem. Data annotation is expensive, tedious and involves a lot of human efforts. Even after data annotation, it is essential to validate the goodness of annotations, which again is a tiresome process. To address this problem, we investigate methods to autonomously learn and adapt the object models using temporal information in videos. This involves learning robust representations of the video. The aim of this thesis is two-fold, first we propose solutions for efficient and accurate object annotation mechanisms in video sequences and secondly, to raise awareness in the community about the importance and attention it deserves. As our first contribution, we propose an efficient, scalable and accurate object bounding box annotation method for large scale complex video datasets. We focus on minimizing the annotation efforts simultaneously increasing the annotation propagation accuracy to get a precise and tight bounding box around object of interest. Using a self training approach, we propose a combination of semi-automatic initialization method with an energy minimization framework to propagate the annotations. Using an energy minimization system for segmentation gives accurate and tight bounding boxes around the object. We have quantitatively and qualitatively validated the results on publicly available datasets. In the second half, we propose annotation scheme for human pose in video sequences. The proposed model is based on a fully-automatic initialization, from any generic state-of-the-art method. But the initialization is prone to error due to the challenges in video data type. We exploit the availability of redundant information from the redundant data type. The model is build on the temporal smoothness assumption in videos. We formulate the problem as a sequence-to-sequence learning problem, the architecture uses Long Short Term Memory encoder-decoder model to encode the temporal context and annotate the pose. We show results on state-of-the-art datasets.


Year of completion:  June 2018
 Advisor : C V Jawahar,Vineeth Balasubramanian

Related Publications

  • Sirnam Swetha, Vineeth N Balasubramanian and C. V. Jawahar - Sequence-to-Sequence Learning for Human Pose Correction in Videos 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]

  • Sirnam Swetha, Anand Mishra, Guruprasad M. Hegde and C. V. Jawahar - Efficient Object Annotation for Surveillance and Automotive Applications - Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshop (WACVW 2016), March 7-9, 2016. [PDF]

  • Rajat Aggarwal, Sirnam Swetha, Anoop M. Namboodiri, Jayanthi Sivaswamy, C. V. Jawahar - Online Handwriting Recognition using Depth Sensors Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]


Downloads

thesis

Landmark Detection in Retinal Images


Gaurav Mittal (Home Page)

Abstract

Advances in medical field and imaging systems have resulted in a series of devices that sense, record, transform and process digital data. In the case of human eyes this digital data is fundus images, which are images of the back part of our retina. Automatic analysis of these images is required to process large amount of data and help doctors make the final diagnosis. Retina images has 3 major visible landmarks: Optic disk(OD), macula and blood vessels. In retina images, OD appears as a bright elliptical structure, macula appear as a small dark region and blood vessel appears as dark tree branch like structure. In this thesis, we have proposed methods for detection of retina landmarks. Accurate detection of OD and macula is important as computer assisted diagnosis systems uses location of these landmarks for understanding the retina image and using clinical facts about retina for improving diagnosis. Retina landmark detection also aids in assessing the severity of diseases based on the locations of abnormalities relative to these landmarks. We first used retina atlas for OD and macula detection. The idea of retina atlas is inspired by brain 3D atlas [34]. We create 2 retina atlases: intensity atlas and probability atlas, by annotating public datasets locally. We use probabilistic atlas for OD and macula detection but detection rates and accuracy of the system is low. To achieve better detection, we than used Generalized motion patterns(GMP) [14][23] for OD and macula detection. The GMP is derived by inducing motion to an image, which serves to smooth out unwanted information while highlighting the structures of interest. Our GMP based detection is fully unsupervised and its results outperformed all other unsupervised methods. The results are comparable to that of supervised methods. The proposed GMP based system is completely parallelizable and handles illumination differences efficiently. Blood vessels are another important retina landmark and we find that the current research uses evaluation measure like sensitivity, specifity, accuracy, area under curve and matthew correlation coefficient for evaluating vessel segmentation performance. We find several gaps in current evaluation measures and propose local accuracy, which is an extension of [39]. We show that local accuracy is especially useful in settings, where segmentation of weak vessels and accurate estimation of vessel width is required.


Year of completion:  June 2018
 Advisor : Jayanthi Sivaswamy

Related Publications

  • Gaurav Mittal, Jayanthi Sivaswamy - Optic Disk and Macula Detection from Retinal Images using Generalized Motion Pattern Proceedings of the Fifth National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2015), 16-19 Dec 2015, Patna, India. [PDF]


Downloads

thesis

Multimodal Emotion Recognition from Advertisements with Application to Computational Advertising


Abhinav Shukla (Home Page)

Abstract

Advertisements (ads) are often filled with strong affective content covering a gamut of emotions intended to capture viewer attention and attempt to convey an effective message. However, most approaches to computationally analyze the emotion present in ads are based on the text modality and only a limited amount of work has been done on affective understanding of advertisements videos from the content and user-centric perspectives. This work attempts to bring together recent advances in deep learning (especially in the domain of visual recognition) and affective computing, and use them to perform affect estimation of advertisements. We first create a dataset of 100 ads which are annotated by 5 experts and are evenly distributed over the valence-arousal plane. We then perform content-based affect recognition via a transfer learning based approach to estimate the affective content in this dataset using prior affective knowledge gained from a large annotated movie dataset. We employ both visual features from video frames and audio features from spectrograms to train our deep neural networks. This approach vastly outperforms the existing benchmark. It is also very interesting to see how human physiological signals, such as that captured by Electroencephalography (EEG) data are able to provide useful affective insights into the content from a user-centric perspective. Using this time series data of the electrical activity of the brain, we train models which are able to classify the emotional dimensions of this data. This also enables us to benchmark this user-centric performance and compare it to the content-centric deep learning based models, and we find that the user-centric models outperform the content-centric models, and set the state-of-the-art in ad affect recognition. We also combine the two kinds of modalities (audiovisual and EEG) using decision fusion and find that the fusion performance is greater than either single modality, which shows that human physiological signals and the audiovisual content contain complementary affective information. We also use multi task learning (MTL) on top of the features of each kind to exploit the intrinsic relatedness of the data and boost the performance. Lastly, we validate the hypothesis of better affect estimation being able to enhance a real world application by supplying the affective values computed by our methods to a computational advertising framework to get a video program sequence with ads inserted at emotionally relevant points, determined to be appropriate based on the affective relevance between the the program content and the ads. Multiple user studies find that our methods significantly outperform the existing algorithms and are very close (and sometimes better than) human level performance. We are able to achieve much more emotionally relevant and non disruptive advertisement insertion into a program video stream. In summary, this work (1) compiles an affective ad dataset capable of evoking coherent emotions across users; (2) explores the efficacy of content-centric convolutional neural network (CNN) features for affect recognition (AR), and find that CNN features outperform low level audio-visual descriptors; (3) study user-centric ad affect recognition from Electroencephalogram (EEG) responses (with conventional classifiers as well as a novel CNN architecture for EEG) acquired while viewing the content that outperform content descriptors; (4) Examine a multi-task learning framework based on CNN and EEG features which provides state of the art AR from ads; (5) Demonstrates how better affect predictions facilitates more effective computational advertising in a real world application.


Year of completion:  June 2018
 Advisor : Ramanathan Subramanian

Related Publications

  • Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli and Ramanathan Subramanian -  Evaluating Content-centric vs User-centric Ad Affect Recognition In Proceedings of 19th ACM International Conference on Multimodal Interaction (ICMI'17), 2017.[PDF]

  • Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli and Ramanathan Subramanian -  Affect Recognition in Ads with Application to Computational Advertising In Proceedings of 25th ACM International Conference on Multimedia (ACM MM '17), 2017.[PDF]


Downloads

thesis

More Articles …

  1. Unconstrained Arabic & Urdu Text Recognition using Deep CNN-RNN Hybrid Networks
  2. Studies in Recognition of Telugu Document Images
  3. Optical Character Recognition as Sequence Mapping
  4. Human Pose Estimation: Extension and Application
  • Start
  • Prev
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. MS Thesis
  5. Thesis Students
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.