Abstract
Recent work has shown that language models based on sentence captions of images are good models of high-level ventral visual cortex, on par with vision models. Text manipulation experiments reveal that this match to the ventral stream is strongly dependent on the nouns in the image captions, suggesting that language models perform well because they represent the things (i.e., agents and objects) in an image. However, the visual world is much richer than static things. We see people dynamically interacting with objects and other people. These dynamic scenes have been shown to more strongly activate visual cortex, and high-level lateral regions, in particular, uniquely respond to dynamic social content. Can vision and language models predict responses to dynamic social scenes in ventral and lateral visual cortices? To investigate this question, we used a large-scale dataset of three-second clips of social actions and collected sentence captions of each clip. Comparing the prediction of vision and language models, we first find that language and visual models similarly predict responses in ventral visual cortex, extending prior work with static images to dynamic scenes. In contrast, we find that language models outperform vision models in predicting lateral visual cortex. Next we performed sentence manipulation experiments in which we selectively remove parts of speech from the sentence captions. First, we replicate prior work that nouns, but not verbs, alone yield high prediction of ventral visual cortex. In contrast, in lateral visual cortex, verbs and nouns are similarly highly predictive, and removing only the verbs impairs performance more than only removing the nouns. Taken together, these results suggest that language models’ match to lateral visual cortex relies on action information and that good models of these regions must contain representations of not just agents and objects but also their actions and interactions.