Abstract
How does the human visual system recognize objects---through discriminative inference (fast but potentially unreliable) or using a generative model of the world (slow but potentially more robust)? The question of how the brain combines the best of both worlds to achieve fast and robust inference has been termed "the deep mystery of vision" (Kriegeskorte 2015). Yet most of today's leading computational models of human vision are simply based on discriminative inference, such as convolutional neural networks or vision transformers trained on object recognition. In contrast, we here revisit the concept of vision as generative inference. This idea dates back to the notion of vision as unconscious inference proposed by Helmholtz (1867), who hypothesized that the brain uses a generative model of the world to infer probable causes of sensory input. In order to build a generative model capable of recognizing objects, we take some of the world's most powerful generative text-to-image models (Stable Diffusion, Imagen and Parti) and turn them into zero-shot image classifiers using Bayesian inference. We then compare those generative classifiers against a broad range of discriminative classifiers and against human psychophysical object recognition data from the "model-vs-human" toolbox (Geirhos et al. 2021). We discover four emergent properties of generative classifiers: They show a record-breaking human-like shape bias (99% for Imagen), near human-level accuracy on challenging distorted images, and state-of-the-art alignment with human classification errors. Last but not least, generative classifiers understand certain perceptual illusions such as the famous bistable rabbit-duck illusion or Giuseppe Arcimboldo's portrait of a man's face composed entirely of vegetables, speaking to their ability to discern ambiguous input and distinguish local from global information. Taken together, our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data remarkably well.