Abstract
The main success stories of deep learning in visual perception tasks starting with ImageNet, has relied on convolutional neural networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional networks (DCN) over fully connected networks (FCN) in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Here we explore a set of hierarchical and non-hierarchical visual tasks in which we study how network performance is affected by disrupting locality through scrambling in the input image space. In particular our experiments consist of having 2 distinct networks that vary in their computation: a fully connected network that has no locality prior, and a deep convolutional network that has a locality prior. These networks performed 3 tasks: Color Estimation (that does not require locality), Object Classification and Scene Gist. We verify that fully connected networks that possess no locality prior stay stable (albeit weaker) in performance even as the images are scrambled, across all the 3 tasks --with the exception of the color estimation task where although performance does not change, and FCNs achieve superior performance to that of DCNs. However, when the images are fully scrambled, we find that deep convolutional networks can perform worse than fully connected networks across all 3 tasks --with little variation for a non-hierarchical task. Finally, we show that small departures from the locality bias grow during learning with gradient descent even on a hierarchical task, suggesting that this bias cannot be purely learned from data without additional constraints.