Abstract
In the field of vision science, recent endeavours have aimed to assess the comparative performance of artificial neural network models against human vision. Methodologies often involve the utilization of benchmarks that intentionally perturb or disturb images, thereby measuring noise sensitivity to gain insights into important features for object recognition. Recent studies employing critical frequency band masking have unveiled a perspective, positing that neural networks strategically exploit a wider band and less stable frequency channel compared to the one-octave band of human vision. In this work, we extend the inquiry to encompass diverse modern computer vision models, it becomes apparent that a considerable number of recently developed models outperform human capabilities in the presence of frequency noise. This ascendancy is not merely attributable to conventional techniques such as input image data augmentation but also crucially stems from the proficient exploitation of semantic information within expansive datasets, coupled with rigorous model scaling. Conceiving semantic information from multimodal training as a variant of output augmentation, we posit that augmenting input images and labels holds the potential to improve artificial neural networks to go beyond human performance in the current benchmarks. These advantages establish the idea that these models can be complementary agents for humans, particularly in challenging conditions. Despite acknowledging this progress, we must recognize a limitation in computer vision benchmarks, as they do not comprehensively quantify human vision. Consequently, we emphasize the imperative for vision science-inspired datasets to measure the alignment between models and human vision.