September 2024
Volume 24, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2024
Learning to discriminate by learning to generate: zero-shot generative models increase human object recognition alignment
Author Affiliations
  • Robert Geirhos
    Google DeepMind
  • Kevin Clark
    Google DeepMind
  • Priyank Jaini
    Google DeepMind
Journal of Vision September 2024, Vol.24, 132. doi:https://doi.org/10.1167/jov.24.10.132
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Robert Geirhos, Kevin Clark, Priyank Jaini; Learning to discriminate by learning to generate: zero-shot generative models increase human object recognition alignment. Journal of Vision 2024;24(10):132. https://doi.org/10.1167/jov.24.10.132.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

How does the human visual system recognize objects---through discriminative inference (fast but potentially unreliable) or using a generative model of the world (slow but potentially more robust)? The question of how the brain combines the best of both worlds to achieve fast and robust inference has been termed "the deep mystery of vision" (Kriegeskorte 2015). Yet most of today's leading computational models of human vision are simply based on discriminative inference, such as convolutional neural networks or vision transformers trained on object recognition. In contrast, we here revisit the concept of vision as generative inference. This idea dates back to the notion of vision as unconscious inference proposed by Helmholtz (1867), who hypothesized that the brain uses a generative model of the world to infer probable causes of sensory input. In order to build a generative model capable of recognizing objects, we take some of the world's most powerful generative text-to-image models (Stable Diffusion, Imagen and Parti) and turn them into zero-shot image classifiers using Bayesian inference. We then compare those generative classifiers against a broad range of discriminative classifiers and against human psychophysical object recognition data from the "model-vs-human" toolbox (Geirhos et al. 2021). We discover four emergent properties of generative classifiers: They show a record-breaking human-like shape bias (99% for Imagen), near human-level accuracy on challenging distorted images, and state-of-the-art alignment with human classification errors. Last but not least, generative classifiers understand certain perceptual illusions such as the famous bistable rabbit-duck illusion or Giuseppe Arcimboldo's portrait of a man's face composed entirely of vegetables, speaking to their ability to discern ambiguous input and distinguish local from global information. Taken together, our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data remarkably well.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×