Abstract
Human observers can readily perceive and recognize visual objects, even when occluding stimuli obscure much of the object from view. By contrast, state-of-the-art convolutional neural networks (CNNs) perform poorly at classifying occluded objects. In previous work, we evaluated 30 humans and various CNNs on an occluded object benchmark containing nine occluder types (e.g., mud-splashes, bars, polkadots) and six visibility levels. We showed that augmenting CNN training datasets with artificial occluders led to higher classification accuracy, but a less human-like pattern of accuracy across the different occluder types. In the present study, we explored whether a human-like form of occlusion robustness could be acquired through more naturalistic modifications to the learning environment that better reflect human visual experience. First, we trained three CNNs with identical architectures to classify differently augmented ImageNet databases: unaltered (baseline), occlusion by uniformly coloured, computer-generated shapes (artificial), and occlusion by other objects extracted from photographs (natural). After training each model, we measured classification accuracy and human-likeness using the occluded object benchmark. Human-likeness was measured through both image-wise error-consistency and by correlating the profile of accuracies across conditions. Both types of occlusion training increased accuracy compared to baseline. However, natural occlusion training increased human-likeness (both measures), while artificial occlusion training showed mixed results (higher image-wise, lower condition-wise). Improvements in both accuracy and human-likeness from natural occlusion training were partially reduced when the natural occluder texture was replaced by uniform colour during training, suggesting that both the shape and texture of natural occluders play a role in human occlusion robustness. Finally, for each dataset augmentation, substituting supervised classification training with a more naturalistic, self-supervised task (contrastive learning) led to equal-or-better human-likeness. Taken together, these results indicate that occlusion-robust object recognition in humans emerges in part from unsupervised engagement with the specific forms of occlusion that occur in nature.