December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
What can 5.17 billion regression fits tell us about the representational format of the high-level human visual system?
Author Affiliations & Notes
  • Talia Konkle
    Harvard University
  • Colin Conwell
    Harvard University
  • Jacob S. Prince
    Harvard University
  • George A. Alvarez
    Harvard University
  • Footnotes
    Acknowledgements  NSF PAC COMP-COG 1946308, NSF CAREER BCS-1942438
Journal of Vision December 2022, Vol.22, 4422. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Talia Konkle, Colin Conwell, Jacob S. Prince, George A. Alvarez; What can 5.17 billion regression fits tell us about the representational format of the high-level human visual system?. Journal of Vision 2022;22(14):4422.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Deep neural network models are often taken to be direct models of the hierarchical visual system; under this framework, benchmarking efforts like BrainScore (Schrimpf et al., 2018) often seek a single model with the overall best brain predictivity. However, these models also provide unprecedented experimental opportunities to systematically explore visual representation formation, by manipulating either the visual diet, the architectural inductive biases, or the format pressures induced by the task, while holding other factors constant. Here we consider targeted comparisons from >110 models and leverage the most extensive fMRI data set collected on visual system responses to date (NSD) to explore which factors give rise to more or less brain-like representational formats. The factor which showed the biggest variation in visual system brain predictivity was the task. Holding both architecture and input constant, object categorization creates a more brain-like representation relative to other tasks like autoencoding, segmentation, and depth prediction. Self-supervised tasks (e.g. SimCLR, BarlowTwins, CLIP) showed comparable or improved brain-predictivity relative to architecture-matched supervised object categorization networks. In contrast, even extremely diverse architectures (e.g. CNNs, transformers, MLP-mixers), holding constant both the input and task of object categorization, showed little to no difference in brain predictivity. Notably, the analytical method employed (e.g. RSA with or without voxel-wise feature weighting) also had a dramatic impact on brain-predictivity magnitude. Each analysis method makes implicit theoretical commitments about the linking hypotheses between artificial neurons, voxel responses, and structure of population geometry, which warrant deeper consideration. Broadly, while these results provide a current snapshot of the best-fitting models of the human ventral visual stream, here we also offer controlled model comparison as a paradigm to advance our understanding the pressures guiding visual representation formation, with the aspiration of building increasingly stable insights with every highly-performant model that arrives on the scene.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.