Abstract
Background. Deep convolutional neural networks (DCNNs) trained to classify objects can perform at human levels and are predictive of brain response in both human and non-human primates. However, some studies suggest that DCNN models are less sensitive to global configural relationships than humans, relying instead on ‘bags’ of local features (Brendel & Bethge, 2019). Here we employ a novel method to compare human and DCNN reliance on configural features for object recognition. Methods. We constructed a dataset consisting of 640 ImageNet images from 8 object classes (80 images per class). We partitioned each of these images into square image blocks to create four levels of configural disruption: 1) No disruption - intact images; 2) Occlusion - alternate blocks painted mid-gray; 3) Scrambled - blocks randomly permuted; 4) Woven - alternate blocks replaced with random blocks from a distractor image of a different category. We then assessed human and VGG-16 object recognition performance at each level of disruption for 4x4, 8x8, 16x16, and 32x32 block partitions. Results. While block scrambling lowered both human and network performance, humans were much less impacted by occlusion than the network model. Also, while humans performed as well or better in the occlusion condition than in the scrambled condition, the network consistently performed better in the scrambled condition than the occlusion condition. In the woven condition, neither humans nor the network were able to reliably discriminate the coherent from the scrambled images, but we found that fine-tuning the network to report the class of the coherent image led to human levels of performance on the occlusion task. Implications. Both humans and the network were found to rely to some degree on configural processing. While humans may handle occlusion better than standard ImageNet-trained networks, training on woven imagery leads to human-like robustness to occlusion.