CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Skyline Segmentation Using Shape-constrained MRFS


Rashmi Tone Vilas (homepage)

MRF energy minimization has been used for image segmentation in a wide range of applications. Standard MRF energy minimization techniques are computationally expensive. Besides, incorporating higher order priors such as shape and parameters related to it is either very complex or computationally expensive or requires prior information such as shape location. Furthermore, semantic understanding is not achieved using pure MRF formulation, i.e. information about the structure of a skyline such as depth cannot be known through output. Standard semantic segmentation methods using geometric context information is restricted to very few geometric classes or the ones which exploit specific “tiered” structure is computationally exponential in number of labels.

Our aim is to extract the detailed structure of a skyline, i.e. individual buildings and their depth. In this case, there is no restriction on the number of labels. The problem is challenging due to numerous reasons such as complex occlusion patterns, large number of labels and intra-region color and texture variations, etc. We propose an approach for segmenting the individual buildings in typical skylines. Our approach is based on a Markov Random Field (MRF) formulation that exploits the fact that such images contain overlapping objects of similar shapes exhibiting a “tiered” structure. Our contributions are the following:

  • We introduce a dataset Skyline-12 consisting of 120 skyline images from the 12 cities all over the world. All the images are manually annotated with addition of meta-data like initial boundaries and seeds.
  • We include an analysis and integration of low-level features such as color, texture and shape very useful for the segmentation of skylines.
  • We propose a fast, accurate and robust method to extract individual buildings of a skyline exploit- ing “tiered” structure of a skylines and incorporating rectangular shape prior in MRF formulation.

For simple shapes such as rectangles, our formulation is significantly faster to optimize than a standard MRF approach, while also being more accurate. We experimentally evaluate various MRF formulations and demonstrate the effectiveness of our approach in segmenting skyline images.

We propose both Interactive and Automatic methods for segmenting skylines. While interctive set- ting gives an accurate output and a fast approach to segment skylines given input seeds from user, automatic setting provides about 25% improvement over state-of-art low level automatic segmentation methods. Our approach can be generalized to different shapes as well as detailed structure of a skyline can be used in many applications such as 3D reconstruction of a skyline from single image.

 

Year of completion:  January 2015 
 Advisor : Prof. C. V. Jawahar

Related Publications

     


    Downloads

     thesis

     ppt

    Enhancing Bag of Words Image Representations


    Vinay Garg (Homepage)

    Bag of visual words inspired from text classification has been used extensively for solving various computer vision tasks such as image classification, image retrieval, image recognition, etc. In text classification the vocabulary set is a fixed finite set of words present in a particular language, which is not the case in visual domain. There is no fixed set of visual words in visual domain. Infact there is no concept of words in images. This is due to the fact that complexity in visual domain is so high as even a small change like rotation, translation, change of view angle, lightning, etc. will have a huge impact on the information perceived by the machines for these images. Even though for us these changes will not make any difference but for machines, these all will be different images. So to overcome this problem, vision community has defined the concept of visual words, which are analogous to textual words. But the visual words are not very well defined due to vast domain of the visual data as compared to textual data which have finite number of words. Using these visual words we create the image representations as the frequency of these visual words in that image and in turn use these representations to do various vision tasks.

    In this thesis we aim at improving these image representations, as the accuracy and performance of various vision models depends directly on quality of image representations given to them as input. We started with the traditional bag of visual words, study various practical issues and drawbacks in that approach, tried refining one of the various steps of pipeline at a time. Doing so we devised novel strategies to overcome some of the issues which we faced while studying the traditional approaches. In the approaches which we applied to solve the issues, we used various parameters which needed fine tuning and we have discussed the effect of each parameter in detail with the empirical results to support our hypothesis and finally conclude that our representations were better as compared to the various traditional approaches presently used.

    To solve the problem of information loss due to hard assignments in traditional bag of words, we analyzed various soft assignment techniques. On replacing the hard assignments with soft assignments, we found that classification results improved drastically. Even while comparing different soft assignment techniques among themselves, we found that absolute soft assignments are better as compared to relative soft assignments. We demonstrate the superiority of our approaches on various popular datasets.

    Recently vision community showed that Fisher vector image representations outperform Bag of words representations. This boost in performance is because, firstly Fisher vectors use soft assignments and secondly, they reduce the information loss by capturing the deviations of each visual feature from the mean. However, like any other approach they also have their share of drawbacks. Size of Fisher vector image representations is huge and they are not inherently discriminative. So, we introduced sparseness to reduce the effective size of the representations, and added some class information to make them discriminative. These additions reduces the high storage requirements, but at the same time adding class information also increases the performance. To demonstrate these findings, we tested it on various datasets which supported our claim.

    Driven from the hypothesis that improving individual steps of various image representations pipelines will improve the final image representation, we have tried various techniques to refine these steps, which in turn will improve the performance of our model. After improving the final step of creating image representations from the visual words, we tried improving the set of visual words (or vocabulary) itself. We found that most of the visual words which we use for building the image representations are not actually useful and there are a lot of redundant words present in the visual vocabulary. So, to further improve our representations, we devised a novel technique to combine various visual words from different types of vocabularies (which will capture different type of information from a given set of images), and combine the best of them to get the final global vocabulary, which will be used to get the final image representations. Again, we used benchmark datasets to demonstrate that our hypothesis is correct.

    In this thesis, we have tried our hands to solve the classification task in a better way. Although we are not able to solve the research tasks perfectly (i.e. reaching the perfect score), but we hope that our findings will atleast give a starting point for various new directions which will lead us to our ultimate goal of replicating the human vision.

     

    Year of completion:  May 2015 
     Advisor : Prof. C. V. Jawahar

                           


    Related Publications

    • Vinay Garg, Siddhartha Chandra, C V Jawahar - Sparse Discriminative Fisher Vectors in Visual Classification Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

    • Vinay Garg, Sreekanth Vempati and C.V. Jawahar - Bag of visual words: A soft clustering based exposition Proceedings of 3rd National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, ISBN 978-0-7695-4599-8, pp.37-40 15-17 Dec. 2011, Hubli, India. [PDF]

     


    Downloads

      thesis

    ppt

    • Start
    • Prev
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • Next
    • End
    1. You are here:  
    2. Home
    3. Research
    4. Thesis
    5. Thesis Students
    Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.