Abstract
It’s often taken for granted that the best models of visual cortex are vision models. Recent research into models that learn from various combinations of vision and language, however, has reinvigorated longstanding debates over just how visual our models of visual cortex really need be. In this work, we characterize where and to what extent unimodal language models or multimodal vision-language models best predict evoked visual activity in the human ventral stream. We do this with a series of controlled modeling experiments on brain responses in 4 subjects responding to 1000 images from the Natural Scenes Dataset (NSD), with both classical and voxel-reweighted RSA (veRSA). Using a series of models which consist of pure SimCLR-style visual self-supervision, pure CLIP-style language-alignment, or a combination of the two, we first demonstrate that language-aligned models -- when controlling for dataset -- are in fact no better than unimodal vision models at predicting activity in the ventral stream. We next use captions associated with the NSD images to the test the brain predictivity of language embeddings from across the processing hierarchy of (N=24) unimodal language models (e.g. SentenceBERT, GPT2), demonstrating that while these kinds of embeddings systematically fail to predict activity in early visual cortex, they perform on par with unimodal vision models (N=19) in occipitemporal cortex (with classical and veRSA scores of up to 43% and 67%, respectively). Finally, in a series of text manipulation experiments (e.g. word scrambling, nouns only), we show that the predictive power of these models seems predicated almost entirely on simple nouns in no syntactic order (with veRSA scores of up to 61%). These results qualify recent excitement about language-alignment in the ventral stream, and suggest language models are only successful models of high-level vision to the extent they capture information about the objects present in an image.