Abstract
Background: While humans are highly sensitive to global shape information, deep neural networks models (DNNs) trained on ImageNet seem to favor local shape features. In the Fourier descriptor (shape frequency) domain, this manifests as much higher human sensitivity to low shape frequencies. Here we ask how this differential sensitivity depends upon the amplitude vs phase structure of these Fourier shape components. Methods: Human observers (n=68) classified animal silhouettes into nine categories. The shapes were lowpass filtered in the shape frequency domain, over a range of frequency cutoffs, using two filtering methods. In method 1, Fourier components beyond the cutoff were zeroed. In method 2, phases were randomized but amplitudes were preserved. We compared human performance against three representative networks: a convolutional model (ResNet-50) and two transformer models (ViT, SWIN). Results: While switching from filtering method 1 to method 2 resulted in a slight decline in human performance, it led to a significant improvement for the networks. What could explain this improvement? One possibility is that networks were simply confused by the smooth shapes produced by method 1. To assess this possibility, we retested the networks using a third filtering method in which phases were randomized and amplitudes set to normative, uninformative values. While performance improved for these more realistic shape stimuli, for the two transformer models (ViT and SWIN), performance remained below levels seen with method 2, indicating that these networks, unlike humans, are able to make effective use of the amplitude structure of low shape frequency components, even when phases are randomized. Conclusions: While humans use low-frequency shape information more effectively than DNNs, they depend critically on the phase structure of these low-frequency shape components. In contrast, transformer networks exploit the texture-like amplitude structure of these components even when phase is randomized.