December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
The bittersweet lesson: data-rich models narrow the behavioural gap to human vision
Author Affiliations & Notes
  • Robert Geirhos
    University of Tübingen
    International Max Planck Research School for Intelligent Systems
  • Kantharaju Narayanappa
    University of Tübingen
  • Benjamin Mitzkus
    University of Tübingen
  • Tizian Thieringer
    University of Tübingen
  • Matthias Bethge
    University of Tübingen
  • Felix A. Wichmann
    University of Tübingen
  • Wieland Brendel
    University of Tübingen
  • Footnotes
    Acknowledgements  This work was supported by the IMPRS-IS, the Collaborative Research Center (276693517), the German Federal Ministry of Education and Research (01IS18039A), the Machine Learning Cluster of Excellence (2064/1---390727645), and the German Research Foundation (BR 6382/1-1).
Journal of Vision December 2022, Vol.22, 3273. doi:https://doi.org/10.1167/jov.22.14.3273
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel; The bittersweet lesson: data-rich models narrow the behavioural gap to human vision. Journal of Vision 2022;22(14):3273. https://doi.org/10.1167/jov.22.14.3273.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

A major obstacle to understanding human visual object recognition is our lack of behaviourally faithful models. Even the best models based on deep learning classifiers strikingly deviate from human perception in many ways. To study this deviation in more detail, we collected a massive set of human psychophysical classification data under highly controlled conditions (17 datasets, 85K trials across 90 observers). We made this data publicly available as an open-sourced Python toolkit and behavioural benchmark called "model-vs-human", which we use for investigating the very latest generation of models. Generally, in terms of robustness, standard machine vision models make much more errors on distorted images, and in terms of image-level consistency, they make very different errors than humans. Excitingly, however, a number of recent models make substantial progress towards closing this behavioural gap: "simply" training models on large-scale datasets (between one and three orders of magnitude larger than standard ImageNet) is sufficient to, first, reach or surpass human-level distortion robustness and, second, to improve image-level error consistency between models and humans. This is significant given that none of those models is particularly biologically faithful on the implementational level, and in fact, large-scale training appears much more effective than, e.g., biologically-motivated self-supervised learning. In the light of these findings, it is hard to avoid drawing parallels to the "bitter lesson" formulated by Rich Sutton, who argued that "building in how we think we think does not work in the long run" - and ultimately, scale would be all that matters. While human-level distortion robustness and improved behavioural consistency with human decisions through large-scale training is certainly a sweet surprise, this leaves us with a nagging question: Should we, perhaps, worry less about biologically faithful implementations and more about the algorithmic similarities between human and machine vision induced by training on large-scale datasets?

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×