Abstract
In both monkey neurophysiology and human fMRI studies, neural responses to a pair of unrelated objects can be well approximated by the average responses of each constituent object shown in isolation. This shows that at higher levels of visual processing, the whole is equal to the average of its parts. Recent convolutional neural networks (CNNs) have achieved human-like object categorization performance, leading some to propose that CNNs are the current best models of the primate visual system. Does the same averaging principle hold in CNNs trained for object classification? Here we re-examined a previous fMRI dataset where human participants viewed object pairs and their constituent objects shown in isolation. We also examined the activations in five CNNs pre-trained for object categorization to the same images shown to the human participants. The CNNs examined varied in architecture and included both shallower networks (Alexnet and VGG-19), deeper networks (Googlenet and Resnet-50), and a recurrent network with a shallower structure designed to capture the recurrent processing in macaque IT (Cornet-S). While responses to object pairs could be fully predicted by responses to single objects in the human lateral occipital cortex, this was found in neither lower nor higher layers in any of the CNNs tested. The whole is thus not equal to the average of its parts in CNNs. This indicates the existence of interactions between the individual objects in a pair that is not present in the human brain, potentially rendering these objects less accessible in CNNs at higher levels of visual processing than they are in the human brain. The present results unveil an important representational difference between the human brain and CNNs.