Enhancing Bag of Words Image Representations

Vinay Garg (Homepage)

Bag of visual words inspired from text classification has been used extensively for solving various computer vision tasks such as image classification, image retrieval, image recognition, etc. In text classification the vocabulary set is a fixed finite set of words present in a particular language, which is not the case in visual domain. There is no fixed set of visual words in visual domain. Infact there is no concept of words in images. This is due to the fact that complexity in visual domain is so high as even a small change like rotation, translation, change of view angle, lightning, etc. will have a huge impact on the information perceived by the machines for these images. Even though for us these changes will not make any difference but for machines, these all will be different images. So to overcome this problem, vision community has defined the concept of visual words, which are analogous to textual words. But the visual words are not very well defined due to vast domain of the visual data as compared to textual data which have finite number of words. Using these visual words we create the image representations as the frequency of these visual words in that image and in turn use these representations to do various vision tasks.

In this thesis we aim at improving these image representations, as the accuracy and performance of various vision models depends directly on quality of image representations given to them as input. We started with the traditional bag of visual words, study various practical issues and drawbacks in that approach, tried refining one of the various steps of pipeline at a time. Doing so we devised novel strategies to overcome some of the issues which we faced while studying the traditional approaches. In the approaches which we applied to solve the issues, we used various parameters which needed fine tuning and we have discussed the effect of each parameter in detail with the empirical results to support our hypothesis and finally conclude that our representations were better as compared to the various traditional approaches presently used.

To solve the problem of information loss due to hard assignments in traditional bag of words, we analyzed various soft assignment techniques. On replacing the hard assignments with soft assignments, we found that classification results improved drastically. Even while comparing different soft assignment techniques among themselves, we found that absolute soft assignments are better as compared to relative soft assignments. We demonstrate the superiority of our approaches on various popular datasets.

Recently vision community showed that Fisher vector image representations outperform Bag of words representations. This boost in performance is because, firstly Fisher vectors use soft assignments and secondly, they reduce the information loss by capturing the deviations of each visual feature from the mean. However, like any other approach they also have their share of drawbacks. Size of Fisher vector image representations is huge and they are not inherently discriminative. So, we introduced sparseness to reduce the effective size of the representations, and added some class information to make them discriminative. These additions reduces the high storage requirements, but at the same time adding class information also increases the performance. To demonstrate these findings, we tested it on various datasets which supported our claim.

Driven from the hypothesis that improving individual steps of various image representations pipelines will improve the final image representation, we have tried various techniques to refine these steps, which in turn will improve the performance of our model. After improving the final step of creating image representations from the visual words, we tried improving the set of visual words (or vocabulary) itself. We found that most of the visual words which we use for building the image representations are not actually useful and there are a lot of redundant words present in the visual vocabulary. So, to further improve our representations, we devised a novel technique to combine various visual words from different types of vocabularies (which will capture different type of information from a given set of images), and combine the best of them to get the final global vocabulary, which will be used to get the final image representations. Again, we used benchmark datasets to demonstrate that our hypothesis is correct.

In this thesis, we have tried our hands to solve the classification task in a better way. Although we are not able to solve the research tasks perfectly (i.e. reaching the perfect score), but we hope that our findings will atleast give a starting point for various new directions which will lead us to our ultimate goal of replicating the human vision.


Year of completion:  May 2015 
 Advisor : Prof. C. V. Jawahar


Related Publications

  • Vinay Garg, Siddhartha Chandra, C V Jawahar - Sparse Discriminative Fisher Vectors in Visual Classification Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Vinay Garg, Sreekanth Vempati and C.V. Jawahar - Bag of visual words: A soft clustering based exposition Proceedings of 3rd National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, ISBN 978-0-7695-4599-8, pp.37-40 15-17 Dec. 2011, Hubli, India. [PDF]