Abstract
Deep neural network models are often taken to be direct models of the hierarchical visual system; under this framework, benchmarking efforts like BrainScore (Schrimpf et al., 2018) often seek a single model with the overall best brain predictivity. However, these models also provide unprecedented experimental opportunities to systematically explore visual representation formation, by manipulating either the visual diet, the architectural inductive biases, or the format pressures induced by the task, while holding other factors constant. Here we consider targeted comparisons from >110 models and leverage the most extensive fMRI data set collected on visual system responses to date (NSD) to explore which factors give rise to more or less brain-like representational formats. The factor which showed the biggest variation in visual system brain predictivity was the task. Holding both architecture and input constant, object categorization creates a more brain-like representation relative to other tasks like autoencoding, segmentation, and depth prediction. Self-supervised tasks (e.g. SimCLR, BarlowTwins, CLIP) showed comparable or improved brain-predictivity relative to architecture-matched supervised object categorization networks. In contrast, even extremely diverse architectures (e.g. CNNs, transformers, MLP-mixers), holding constant both the input and task of object categorization, showed little to no difference in brain predictivity. Notably, the analytical method employed (e.g. RSA with or without voxel-wise feature weighting) also had a dramatic impact on brain-predictivity magnitude. Each analysis method makes implicit theoretical commitments about the linking hypotheses between artificial neurons, voxel responses, and structure of population geometry, which warrant deeper consideration. Broadly, while these results provide a current snapshot of the best-fitting models of the human ventral visual stream, here we also offer controlled model comparison as a paradigm to advance our understanding the pressures guiding visual representation formation, with the aspiration of building increasingly stable insights with every highly-performant model that arrives on the scene.