Abstract
Deep neural network models are powerful visual representation learners – transforming natural image input into usefully formatted latent spaces. As such, these models give us new inferential purchase on arguments about what is learnable from the experienced visual input, given the inductive biases of different architectural connections, and the pressures of different task objectives. I will present our current efforts to collect the models of the machine learning community for opportunistic controlled-rearing experiments, comparing hundreds of models to human brain responses to thousands of images using billions of regressions. Surprisingly, we find many models have a similar capacity for brain predictivity – including fully self-supervised visual systems with no specialized architectures, that learn only from the structure in the visual input. As such, these results provide computational plausibility for an origin story in which domain-general experience-dependent learning mechanisms guide visual representation, without requiring specialized architectures or domain-specialized category learning mechanisms. At the same time, no models capture all the signatures of the data, inviting testable speculation for what is missing – specified in terms of architectural inductive biases, functional objectives, and distributions of visual experience. As such, this empirical-computational enterprise brings exciting new leverage into the origins underlying our ability to recognize objects in the world.