Abstract
Human observers can readily perceive and recognize visual objects even when occluding stimuli obscure much of the object from view. By contrast, state-of-the-art convolutional neural network (CNN) models of primate vision perform poorly at classifying occluded objects (Coggan and Tong, VSS, 2022). A key difference between biological and artificial visual systems is how they learn from visual examples. CNNs are typically trained using supervised methods to classify images by object category based on labelled data. By contrast, humans learn about objects with a broader range of learning objectives and fewer opportunities for supervised feedback. Here, we asked whether a more naturalistic approach to training CNNs might yield more occlusion-robust models that better predict human neural and behavioural responses to occluded objects. To address this question, we trained an array of CORnet-S model instances with either supervised classification or unsupervised contrastive learning. We also augmented the standard ImageNet dataset by superimposing artificial occluders onto the images. The contrastive learning objective was to produce similar unit activations in the highest layer to differently occluded instances of the same underlying object image. Once training was complete, each model was tested for occlusion robustness and compared to human behavioural and neural responses to occluded objects. For supervised models, we found that training on the occluded dataset led to substantial improvements in classification accuracy for novel occluded objects, relative to the standard dataset. Despite these improvements, occlusion-trained models performed worse at predicting both human behavioural and neural responses to occluded objects, suggesting that these supervised models learned a different type of occlusion-robust mechanism. By contrast, the layer-wise activity patterns found in the unsupervised, contrastively trained models exhibited stronger occlusion robustness and greater human-likeness than any other model, suggesting that human robustness to occlusion may be attributable in part to a natural, unsupervised visual learning environment.