Abstract
Pretrained task-optimized convolutional neural networks are commonly used to predict brain responses to visual stimuli. Yet, they contain biases introduced by their training dataset and task objective (e.g. classification). Recent large-scale visual neuroimaging datasets have opened the avenue towards training modern convolutional neural networks with the objective of directly predicting brain responses measured with human neuroimaging data, which allows overcoming these biases. Here, we used the THINGS and the Natural Scenes Datasets – both massive functional MRI datasets acquired during the presentation of object photographs – to identify a suitable neural network architecture from the machine learning community from a set of candidate architectures (ResNet50, VGG-16, CORnet-S, and others) for predicting responses of individual regions in high-level visual cortex. Careful optimization of these networks yielded voxel-wise encoding models with high correlations, significantly surpassing state-of-the-art encoding performances of task-optimized models based on the same architectures. Treating these brain-optimized networks as in-silico models of ROIs in visual cortex, a sensitivity analysis based on passing millions of images through the network and a GAN-based synthesis of preferred images revealed the expected sensitivity of FFA, PPA, and EBA for faces, places, and body parts, respectively. Our results furthermore revealed novel selectivity, such as close-range pebble patterns in FFA and horizontal and perspectival lines in PPA. Together, these findings demonstrate the feasibility of training common neural network architectures on available massive neuroimaging datasets and provide novel insight into representations underlying human vision.