September 2024
Volume 24, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2024
When Machines Outshine Humans in Object Recognition, Benchmarking Dilemma
Author Affiliations & Notes
  • Mohammad Javad Darvishi Bayazi
    University of Montreal
    Mila - Québec AI Institute
    Faubert Lab
  • Md Rifat Arefin
    University of Montreal
    Mila - Québec AI Institute
  • Jocelyn Faubert
    University of Montreal
    Faubert Lab
  • Irina rish
    University of Montreal
    Mila - Québec AI Institute
  • Footnotes
    Acknowledgements  This work was funded by the Canada CIFAR AI Chair Program and from the Canada Excellence Research Chairs (CERC) program and NSERC discovery grant RGPIN-2022-05122. We thank the Digital Research Alliance of Canada and Mila for providing computational resources.
Journal of Vision September 2024, Vol.24, 1523. doi:https://doi.org/10.1167/jov.24.10.1523
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Mohammad Javad Darvishi Bayazi, Md Rifat Arefin, Jocelyn Faubert, Irina rish; When Machines Outshine Humans in Object Recognition, Benchmarking Dilemma. Journal of Vision 2024;24(10):1523. https://doi.org/10.1167/jov.24.10.1523.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

In the field of vision science, recent endeavours have aimed to assess the comparative performance of artificial neural network models against human vision. Methodologies often involve the utilization of benchmarks that intentionally perturb or disturb images, thereby measuring noise sensitivity to gain insights into important features for object recognition. Recent studies employing critical frequency band masking have unveiled a perspective, positing that neural networks strategically exploit a wider band and less stable frequency channel compared to the one-octave band of human vision. In this work, we extend the inquiry to encompass diverse modern computer vision models, it becomes apparent that a considerable number of recently developed models outperform human capabilities in the presence of frequency noise. This ascendancy is not merely attributable to conventional techniques such as input image data augmentation but also crucially stems from the proficient exploitation of semantic information within expansive datasets, coupled with rigorous model scaling. Conceiving semantic information from multimodal training as a variant of output augmentation, we posit that augmenting input images and labels holds the potential to improve artificial neural networks to go beyond human performance in the current benchmarks. These advantages establish the idea that these models can be complementary agents for humans, particularly in challenging conditions. Despite acknowledging this progress, we must recognize a limitation in computer vision benchmarks, as they do not comprehensively quantify human vision. Consequently, we emphasize the imperative for vision science-inspired datasets to measure the alignment between models and human vision.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×