September 2024
Volume 24, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2024
Language model prediction of visual cortex responses to dynamic social scenes
Author Affiliations & Notes
  • Emalie McMahon
    Johns Hopkins University
  • Colin Conwell
    Johns Hopkins University
  • Kathy Garcia
    Johns Hopkins University
  • Michael F. Bonner
    Johns Hopkins University
  • Leyla Isik
    Johns Hopkins University
  • Footnotes
    Acknowledgements  Funded by NIH R01MH132826 awarded to LI.
Journal of Vision September 2024, Vol.24, 904. doi:https://doi.org/10.1167/jov.24.10.904
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Emalie McMahon, Colin Conwell, Kathy Garcia, Michael F. Bonner, Leyla Isik; Language model prediction of visual cortex responses to dynamic social scenes. Journal of Vision 2024;24(10):904. https://doi.org/10.1167/jov.24.10.904.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Recent work has shown that language models based on sentence captions of images are good models of high-level ventral visual cortex, on par with vision models. Text manipulation experiments reveal that this match to the ventral stream is strongly dependent on the nouns in the image captions, suggesting that language models perform well because they represent the things (i.e., agents and objects) in an image. However, the visual world is much richer than static things. We see people dynamically interacting with objects and other people. These dynamic scenes have been shown to more strongly activate visual cortex, and high-level lateral regions, in particular, uniquely respond to dynamic social content. Can vision and language models predict responses to dynamic social scenes in ventral and lateral visual cortices? To investigate this question, we used a large-scale dataset of three-second clips of social actions and collected sentence captions of each clip. Comparing the prediction of vision and language models, we first find that language and visual models similarly predict responses in ventral visual cortex, extending prior work with static images to dynamic scenes. In contrast, we find that language models outperform vision models in predicting lateral visual cortex. Next we performed sentence manipulation experiments in which we selectively remove parts of speech from the sentence captions. First, we replicate prior work that nouns, but not verbs, alone yield high prediction of ventral visual cortex. In contrast, in lateral visual cortex, verbs and nouns are similarly highly predictive, and removing only the verbs impairs performance more than only removing the nouns. Taken together, these results suggest that language models’ match to lateral visual cortex relies on action information and that good models of these regions must contain representations of not just agents and objects but also their actions and interactions.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×