On Compact Deep Neural Networks for Visual Place Recognition, Object Recognition and Visual Localization


Soham Saha


There has been an immense increase in the use of Deep Neural Networks in recent times due to the availability of more data and greater computing power. With their recent success, it has been a trend to use them extensively in real-time applications. However, the size of deep models can render them incapable to be used in devices with memory-constraints. In this thesis, we explore the several neural network compression techniques for three separate tasks namely i) visual place recognition, ii) object recognition and iii) visual localization. We explore explicit compression methods for the visual place recognition task and the object recognition task, achieved by making modifications to the learned weight matrices. Furthermore, we look at compression attained through architectural modifications in the network itself, proposing novel training procedures and new loss functions for object recognition and visual locali zation. The task of visual place recognition requires us to correctly identify a place given its image, by finding out images of the same place in the world(dataset). Performing this on low memory devices such as mobile phones and robotics systems, is a challenging problem. The state of the art models for this task uses deep learning architectures having close to 100 million parameters which take over 400MB of memory. This makes these models infeasible to be deployed in low memory devices and gives rise to the need of compressing them. Hence, we study the effectiveness of explicit model compression techniques like trained quantization and pruning, on one of the most effective visual place recognition models. We show that a compressed network can be created by starting with a pre-trained model and then fine-tuning it via trained pruning and quantization. Through this training method, the compressed model is able to produce the same mAP as the original uncompressed network. We achieve almost 50% parameter reduction through pruning with no loss in mAP and 70% reduction with close to 2% mAP reduction, while also performing trained 8-bit quantization. Furthermore, together with 5-bit quantization, we perform about 50% parameter reduction by pruning and get only about 3% reduction in mAP. The resulting compressed networks have sizes of around 30 MB and 65 MB which makes them easily usable in memory constrained devices. We next move on to compression through low rank approximation for the task of image classification. Traditional compression algorithms in deep networks involves performing low-rank approximations on the learned weight matrices after the training procedure has been completed. We propose to perform low rank approximation during training itself and make the parameters of the approximated matrix learnable too by using a suitable loss function. We show that by using our method, we are able to compress a base-model providing 89% accuracy, by 10x, with some loss in performance. Using our compression based training procedure, our compressed model is able to achieve an accuracy of about 84%. Next, we focus on developing compressed models for the object recognition task and propose a novel architecture for the same. Deep neural networks for image classification typically consists of a convolutional feature extractor followed by a fully connected classifier network. The predicted and the ground truth labels are represented as one hot vectors. Such a representation assumes that all classes are equally dissimilar. However, classes have visual similarities and often form a hierarchy. We propose an alternate architecture for the classifier network called the Latent Hierarchy Classifier which can discover a latent hierarchy of the classes while simultaneously reducing the number of parameters used in the original classifier. We show that, for some of the best performing architectures on CIFAR and Imagenet datasets, our proposed alternate classifier and training procedure, recovers the accuracy. Also, our proposed method significantly reduces the parameter complexity of the classifier. We achieve a reduction in the number of parameters of the classification layer by 98% for CIFAR 100 and 41% for the Imagenet 1K dataset. We also verify that many visually similar classes are grouped together, under the learnt hierarchy. Finally, we address the problem of Visual Localization where the task is to predict the camera orientation and pose of the given input scene. We propose an anchor point classification based solution for this task by using single camera images only. Our proposed three-way branching of the feature extractor into an Anchor Point Classifier, a Relative Offset Regressor and an Absolute Regressor, is able to achieve <2m translation localization and <5 ◦ pose localization on the Cambridge Landmarks dataset, while also obtaining state-of-the-art in median distance localization for orientation for all the 6 scenes. Our method not only uses fewer parameters than previous deep learning based methods but also improves on memory footprint as well as test-time over nearest neighbour based approaches.


Year of completion:  April 2019
 Advisor : C V Jawahar, Girish Varma

Related Publications