Abstract
We compare how well human observers and neural networks recognize ImageNet images in filtered noise using the critical-band masking paradigm (Fletcher, 1940). We assess the spatial-frequency tuning of object recognition by measuring its noise tolerance at various frequencies. 16 observers and 4 neural networks performed 16-way categorization of 1050 ImageNet images perturbed by band-limited Gaussian noise of five strengths centered at seven spatial frequencies. We find that the noise sensitivity of object recognition is an inverted, U-shaped function of spatial frequency. Human observer performance is severely impaired over an octave-wide band. Such octave-wide selectivity has frequently appeared in the vision literature where it is called a “channel” (Solomon & Pelli, 1994). Here, this channel is revealed in heatmaps of human and network performance. A demo of this channel presents images perturbed with noise at various frequencies and strengths. Together with previous work (Majaj et al., 2002), this shows that the bandwidth of the human channel is conserved across diverse target kinds: real-world objects, gratings, and letters. We find the classic octave-wide frequency band for humans, and a two-octave-wide band for machines, both centered at ~28 cycles per image. At the peak, humans tolerate four times as much noise variance as networks do. When recognizing objects, networks are known to rely on texture cues whereas humans rely on shape (Geirhos et al., 2018). Data augmentation helps bridge this gap (Hermann et al., 2020, Geirhos et al., 2021, Muttenthaler et al., 2022). Our results show that perhaps the popular notion of texture-vs-shape bias simply reflects the width of the spatial-frequency channel.