Abstract
While deep neural networks are state-of-the-art models of many parts of the human visual system, here we show that they fail to process global information in a humanlike manner. First, using visual crowding as a probe into global visual information processing, we found that regardless of architecture, feedforward deep networks successfully model an elementary version of crowding, but cannot exhibit its global counterpart (“uncrowding”). It is not yet well-understood whether this limitation could be ameliorated by substantially larger and more naturalistic training conditions, or by attentional mechanisms. To investigate this, we studied models trained with the CLIP (Contrastive Language-Image Pretraining) procedure, which is a training procedure for a set of attention-based models intended for zero-shot classification of images. CLIP models are trained by self-supervised pairing of generated labels with image inputs on a composite dataset of approximately 400 million images. Due to this training procedure, CLIP models have shown to exhibit highly abstract representations, state-of-the-art performance in zero-shot classification, and to make classification errors that are more in line with the errors humans make than previous models. Despite these advances, we show, by fitting logistic regression models to the activations of layers in CLIP models, that training procedure, architectural differences, nor training dataset size can ameliorate feedforward networks’ inability to reproduce humanlike global information processing in an uncrowding task. This highlights an important aspect of visual information processing: feedforward computations alone are not enough to explain how visual information in humans is combined globally.