Abstract
The performance of convolutional neural networks (CNNs) as representational models of visual cortex is thought to be associated with their optimization on ethologically relevant tasks. Here, we show that this view is incorrect and that there are other architectural and statistical factors that primarily account for their performance. We show this by developing a novel statistically inspired neural network that yields accurate predictions of cortical image representation without the need for optimization on supervised or self-supervised tasks. Our architecture is characterized by a core module of convolutions and max pooling, which can be stacked in a deep hierarchy. An important characteristic of our model is the use of thousands of random filters to sample the high-dimensional space of natural image statistics. These filters can be mapped to cortical responses through a simple linear-regression procedure, which we validate using held-out test data. This statistical-mapping procedure provides an unbiased approach for exploring the tuning properties of higher-level visual neurons without restricting the space of possible filters to those learned on a specific pre-training dataset and task. Remarkably, we find that the model competes with standard supervised CNNs at predicting image-evoked responses in visual cortex in both monkey electrophysiology and human fMRI data but without the need for pre-training, making it orders of magnitude more data-efficient than standard CNNs trained on massive image datasets. Together, our findings reveal a surprisingly parsimonious prescription for the design of high-performance neural network models of cortical representation, and they suggest the intriguing possibility that the computational architecture of the visual cortex could emerge from the replication and elaboration of a core canonical module.