Towards Understanding and Improving the Generalization Performance of Neural Networks

sarath sivaprasad


The widespread popularity of over-parameterized deep neural networks (NNs) is backed by its ‘unreasonable’ performance on unseen data that is Independent, Identically Distributed with respect to the train data. The generalization of NNs cannot be reasoned with the traditional machine learning wisdom that the increase in the number of parameters leads to overfitting on train samples and subsequently reduced generalization. Various generalization measures have been proposed in recent times to explain the generalization property of deep learning networks. Despite some promising investigations, there is little consensus on how we can explain the generalization of NNs. Furthermore, the ability of neural networks to fit any train data for any random configuration of labels makes it more challenging to explain their generalization performance. Despite this ability to completely fit any given data, the neural network seems to be able to ‘cleverly’ learn a generalizing solution. We hypothesize that the ‘simple’ solution lies in a constrained subspace of the hypothesis space. We propose a constraint formulation of neural networks to close the generalization gap. We show that, through a principled constraint, we can achieve comparable train test performance for neural networks. We propose a way to constrain each output of the neural network as the convex combination of its inputs to ensure certain desirable geometry of the decision boundaries. This document covers two major aspects. Firstly, we show the improved generalization of neural networks using convex constraints. The second section goes beyond the IID setting and investigates the generalization of neural networks on the Out Of Distribution test sets.In the first section of the document, we investigate the constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Neural Networks (IOC-NNs) self regularize and significantly reduce the problem of overfitting; (b) Although heavily constrained, they outperform the base multi-layer perceptrons and achieve similar performance as compared to base convolutional architectures and (c) IOCNNs show robustness to noise in train labels. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on six commonly used image classification datasets with three different neural network architectures.In the second section, we revisit the ability of networks to completely fit any given data and yet ‘cleverly’ learn the generalizing hypothesis from the large number of variances that can explain the train data. In accordance with concurrent findings, our explorations show that neural networks learn the most ‘low-lying’ variance in the data. They learn the features that easily correlate with the label and do no further exploration after finding such a solution. With this insight, we reinvent the need to understand the generalization of neural networks and improve them. We go beyond traditional IID and OOD evaluation benchmarks to further our understanding of learning in deep networks. Through our explorations, we give a possible explanation as to why neural networks can do well in certain benchmarks and why other inventive methods fail to give any consistent improvement over a simple neural network. Domain Generalization (DG) requires a model to learn a hypothesis from multiple distributions that generalizes to an unseen distribution. DG has been perceived as a front face of OOD generalization. We present empirical evidence to show that the primary reason for generalization in DG is the presence of multiple domains while training. Furthermore, we show that methods for generalization in IID are equally important for generalization in DG. Tailored methods fail to add performance gains in the Traditional DG (TDG) evaluation. Our experiments prompt if TDG has outlived its usefulness in evaluating OOD generalization? To further strengthen our investigation, we propose a novel evaluation strategy, ClassWise DG (CWDG), where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. Counter-intuitively, despite being exposed to all domains during training, CWDG is more challenging than TDG evaluation. While explaining the observations, our work makes a case for more fundamental analysis around the DG problem before exploring new ideas to tackle it Keywords – generalization of deep networks, constrained formulation, input-output-convex neural network, robust generalization bounds, explainable decision boundaries, mixture of experts

Year of completion:  May 2022
 Advisor : Vineet Gandhi

Related Publications