August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Language Models of Visual Cortex: Where do they work? And why do they work so well where they do?
Author Affiliations
  • Colin Conwell
    Harvard University
  • Jacob S. Prince
    Harvard University
  • George A. Alvarez
    Harvard University
  • Talia Konkle
    Harvard University
Journal of Vision August 2023, Vol.23, 5653. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Colin Conwell, Jacob S. Prince, George A. Alvarez, Talia Konkle; Language Models of Visual Cortex: Where do they work? And why do they work so well where they do?. Journal of Vision 2023;23(9):5653.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

It’s often taken for granted that the best models of visual cortex are vision models. Recent research into models that learn from various combinations of vision and language, however, has reinvigorated longstanding debates over just how visual our models of visual cortex really need be. In this work, we characterize where and to what extent unimodal language models or multimodal vision-language models best predict evoked visual activity in the human ventral stream. We do this with a series of controlled modeling experiments on brain responses in 4 subjects responding to 1000 images from the Natural Scenes Dataset (NSD), with both classical and voxel-reweighted RSA (veRSA). Using a series of models which consist of pure SimCLR-style visual self-supervision, pure CLIP-style language-alignment, or a combination of the two, we first demonstrate that language-aligned models -- when controlling for dataset -- are in fact no better than unimodal vision models at predicting activity in the ventral stream. We next use captions associated with the NSD images to the test the brain predictivity of language embeddings from across the processing hierarchy of (N=24) unimodal language models (e.g. SentenceBERT, GPT2), demonstrating that while these kinds of embeddings systematically fail to predict activity in early visual cortex, they perform on par with unimodal vision models (N=19) in occipitemporal cortex (with classical and veRSA scores of up to 43% and 67%, respectively). Finally, in a series of text manipulation experiments (e.g. word scrambling, nouns only), we show that the predictive power of these models seems predicated almost entirely on simple nouns in no syntactic order (with veRSA scores of up to 61%). These results qualify recent excitement about language-alignment in the ventral stream, and suggest language models are only successful models of high-level vision to the extent they capture information about the objects present in an image.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.