Abstract
BACKGROUND. The statistics of coarse and fine shape features can be analyzed using a Fourier descriptor projection. Natural shapes are known to be lowpass: coarse shape features (low shape frequencies) typically have higher amplitudes than fine shape features (high frequencies). Prior work suggests that human shape sensitivity is even more biased toward low shape frequencies than is optimal for natural lowpass shapes, however this was demonstrated only for a simple binary shape discrimination task within a linear classification framework. Deep networks are reported to be more sensitive to local shape features, suggesting a high-frequency bias, but these demonstrations have primarily been on simple artificial stimuli. Here we employ a novel Fourier method to assess the processing of coarse and fine shape features of natural shapes by humans and deep networks, in a more realistic object classification task. METHOD. Human observers (n = 11) classified frequency-filtered animal silhouettes into one of nine animal categories. To assess sensitivity to shape frequencies, the stimuli were high-pass filtered to progressively remove the lowest shape frequencies, with cutoffs ranging from the 2nd to 8th harmonic. Two representative deep networks were also evaluated on the same stimuli: a convolutional network (ResNet-50) and a transformer network (ViT). RESULTS. Both human and deep network performance declined rapidly as low shape frequencies were progressively eliminated. Trial-by-trial analysis revealed that ViT is more predictive of human responses than ResNet-50. Interestingly, the proportion of explainable human variance accounted for by ViT increased from 29% to 57% as more of the low frequencies were eliminated, suggesting that while this transformer model captures some aspects of human selectivity for higher shape frequencies, it struggles to account for human processing of the lower shape frequencies that largely determine human shape judgements.