Abstract
The many successes of deep neural networks (DNNs) over the past decade have been driven by data and computational scale rather than biological insights. However, as DNNs have continued to improve on benchmarks like ImageNet, they have worsened as models of biological brains and behavior. For instance, recent DNNs with human-level object classification accuracy are no better at predicting human perception or image-evoked responses in primate inferotemporal (IT) cortex than DNNs from a decade ago (e.g., Linsley et al., 2023). Here, we build better DNN models of biological vision by finding data diets and objective functions that more closely resemble those that shape biological brains. We began by building a platform for searching through naturalistic data diets and objective functions for training a standardized DNN architecture at scale. Each DNN’s data diet was sampled from our rendering engine, which generates life-like videos of objects in real-world scenes. In parallel, each model’s objective function was sampled from a parametrized space of image reconstruction objectives, which made it possible to train models to learn combinations of causal and acausal recognition strategies over space or space and time. We evaluated the ability of hundreds of DNNs trained on this platform to predict human performance on a novel “Greebles” object recognition task (Ashworth et al., 2008). We found that DNNs trained to capture the causal structure of data were significantly more predictive of human decisions and reaction times than any other DNN tested. Moreover, these causal DNNs learned strong equivariance to out-of-plane variations in pose, recapitulating classical theory on the foundations of object constancy (Sinha & Poggio,1996) despite no explicit constraints to do so. Our work identifies key limitations in how DNNs are trained today and introduces a better approach for building DNN-based models of human vision that can ultimately advance perceptual science.