Abstract
A critical disconnect between human and machine vision is that people have a remarkable ability to represent and understand the shapes of objects, whereas deep convolutional neural networks (CNNs) operate more as local texture analyzers with no propensity to perceive global form. To isolate and study the representation of global form, we developed a novel stimulus manipulation which enables us to probe global form similarity in a way that is completely isolated from local similarity. Specifically, our method renders natural images as a composition of Gabor elements of different orientations and scales. From this basis, we generate shape metamers--image pairs that differ at each local gabor, but share the same global form information and are nearly perceptually indistinguishable by humans. And, we generate anti-metamers — image pairs that differ by the same amount locally, but disrupt global form information, and look nothing alike to humans. Leveraging these stimuli as a litmus test for global form perception, we find that CNNs trained to do object categorization show little sensitivity to global form information in their feature spaces, which encode both the shape metamers and anti-metamers as equally similar. However, we find that both vision transformer models with self-attention layers, as well as our simpler custom models with hand-designed “association fields,” can learn longer-range relationships between local features, and have increased sensitivity to global form. These findings highlight that the CNN model lacks sufficient inductive biases to learn global form information, and that self-attention and association field mechanisms may serve as key precursor operations which amplify relevant local features. We propose this multiplicative operation is critical to enable downstream mechanisms to encode relationships between primarily shape-defining local features, en route to a more explicit global shape representation.