September 2024
Volume 24, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2024
Is visual cortex really “language-aligned”? Perspectives from Model-to-Brain Comparisons in Human and Monkeys on the Natural Scenes Dataset
Author Affiliations
  • Colin Conwell
    Johns Hopkins University, Department of Cognitive Science
  • Emalie McMahon
    Johns Hopkins University, Department of Cognitive Science
  • Kasper Vinken
    Harvard Medical School, Department of Neurobiology
  • Jacob S. Prince
    Harvard University, Department of Psychology
  • George Alvarez
    Harvard University, Department of Psychology
  • Talia Konkle
    Harvard University, Department of Psychology
  • Leyla Isik
    Johns Hopkins University, Department of Cognitive Science
  • Margaret Livingstone
    Harvard Medical School, Department of Neurobiology
Journal of Vision September 2024, Vol.24, 1288. doi:https://doi.org/10.1167/jov.24.10.1288
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Colin Conwell, Emalie McMahon, Kasper Vinken, Jacob S. Prince, George Alvarez, Talia Konkle, Leyla Isik, Margaret Livingstone; Is visual cortex really “language-aligned”? Perspectives from Model-to-Brain Comparisons in Human and Monkeys on the Natural Scenes Dataset. Journal of Vision 2024;24(10):1288. https://doi.org/10.1167/jov.24.10.1288.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Recent advances in multimodal deep learning and in particular “language-aligned” visual representation learning have re-ignited longstanding debates about the presence and magnitude of language-like semantic structure in the human visual system. A variety of recent works that involve mapping the representations of “language-aligned” vision models (e.g. CLIP) and even pure language models (e.g. GPT, BERT) to activity in the ventral visual stream have made claims that the human visual itself may be “language-aligned” much like recent models. These claims are in part predicated on the surprising finding that pure language models in particular can predict image-evoked activity in the ventral visual stream as well as the best pure vision models (e.g. SimCLR, BarlowTwins). But what would we make of this claim if the same procedures worked in the modeling of visual activity in a species that doesn’t speak language? Here, we deploy controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in prediction of human (N=4) AND rhesus macaque (N=6, 5:IT, 1:V1) ventral stream activity evoked in response to the same set of 1000 captioned natural images (the NSD1000 images). We find (as in humans) that there is effectively no difference in the brain-predictive capacity of pure vision and “language-aligned” vision models in macaque high-level ventral stream (IT). Further, (as in humans) pure language models can predict responses in IT with substantial accuracy, but perform poorly in prediction of early visual cortex (V1). Unlike in humans, however, we find that pure language models perform slightly worse than pure vision models in macaque IT, a gap potentially explained by differences in neural recording alone (fMRI versus electrophysiology). Together, these results suggest that language model predictivity of the ventral stream is not necessarily due to language per se, but rather to the statistical structure of the visual world as reflected in language.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×