Abstract
The performance of convolutional neural networks as models of visual cortex relies on pre-training millions of parameters, optimizing them for a specific classification task. This process not only requires massive computational resources, but also results in learned features whose effectiveness is limited by the dataset richness. Furthermore, the time and resource intensive nature of this training process discourages iterative parameter studies, further reducing the interpretability of high-performing models of visual cortex. Here we propose a theoretically grounded convolutional architecture in which the training process is limited to learning linear combinations of pre-defined wavelet filters. This simplified model is based on an iterative process of expanding and subsequently reducing dimensionality in a deep hierarchy of modules, where each module consists of a filtering operation, followed by a non-linearly and channel mixing. We show that this model rivals a traditional pre-trained CNN in explaining stimuli-evoked neural responses to natural scenes in the human visual cortex Our model generates a useful set of features that can be combined to extract information from a wide range of stimuli, and it reduces the number of learned parameters by orders of magnitude. This model can enable neuroscientists to more efficiently perform in-silico analyses and controlled rearing experiments on deep learning models. Moreover, it can also give insights about how visual computation occurs in the brain, owing to its simple organization and reduced dependence on training.