September 2021
Volume 21, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2021
Configural processing in humans and deep convolutional neural networks
Author Affiliations & Notes
  • Shaiyan Keshvari
    York University
  • Xingye Fan
    York University
  • James H. Elder
    York University
  • Footnotes
    Acknowledgements  Natural Sciences and Engineering Research Council (NSERC) of Canada; York University Vision: Science to Applications (VISTA) program
Journal of Vision September 2021, Vol.21, 2887. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Shaiyan Keshvari, Xingye Fan, James H. Elder; Configural processing in humans and deep convolutional neural networks. Journal of Vision 2021;21(9):2887.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Background. Deep convolutional neural networks (DCNNs) trained to classify objects can perform at human levels and are predictive of brain response in both human and non-human primates. However, some studies suggest that DCNN models are less sensitive to global configural relationships than humans, relying instead on ‘bags’ of local features (Brendel & Bethge, 2019). Here we employ a novel method to compare human and DCNN reliance on configural features for object recognition. Methods. We constructed a dataset consisting of 640 ImageNet images from 8 object classes (80 images per class). We partitioned each of these images into square image blocks to create four levels of configural disruption: 1) No disruption - intact images; 2) Occlusion - alternate blocks painted mid-gray; 3) Scrambled - blocks randomly permuted; 4) Woven - alternate blocks replaced with random blocks from a distractor image of a different category. We then assessed human and VGG-16 object recognition performance at each level of disruption for 4x4, 8x8, 16x16, and 32x32 block partitions. Results. While block scrambling lowered both human and network performance, humans were much less impacted by occlusion than the network model. Also, while humans performed as well or better in the occlusion condition than in the scrambled condition, the network consistently performed better in the scrambled condition than the occlusion condition. In the woven condition, neither humans nor the network were able to reliably discriminate the coherent from the scrambled images, but we found that fine-tuning the network to report the class of the coherent image led to human levels of performance on the occlusion task. Implications. Both humans and the network were found to rely to some degree on configural processing. While humans may handle occlusion better than standard ImageNet-trained networks, training on woven imagery leads to human-like robustness to occlusion.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.