Abstract
Recent advances in multimodal deep learning and in particular “language-aligned” visual representation learning have re-ignited longstanding debates about the presence and magnitude of language-like semantic structure in the human visual system. A variety of recent works that involve mapping the representations of “language-aligned” vision models (e.g. CLIP) and even pure language models (e.g. GPT, BERT) to activity in the ventral visual stream have made claims that the human visual itself may be “language-aligned” much like recent models. These claims are in part predicated on the surprising finding that pure language models in particular can predict image-evoked activity in the ventral visual stream as well as the best pure vision models (e.g. SimCLR, BarlowTwins). But what would we make of this claim if the same procedures worked in the modeling of visual activity in a species that doesn’t speak language? Here, we deploy controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in prediction of human (N=4) AND rhesus macaque (N=6, 5:IT, 1:V1) ventral stream activity evoked in response to the same set of 1000 captioned natural images (the NSD1000 images). We find (as in humans) that there is effectively no difference in the brain-predictive capacity of pure vision and “language-aligned” vision models in macaque high-level ventral stream (IT). Further, (as in humans) pure language models can predict responses in IT with substantial accuracy, but perform poorly in prediction of early visual cortex (V1). Unlike in humans, however, we find that pure language models perform slightly worse than pure vision models in macaque IT, a gap potentially explained by differences in neural recording alone (fMRI versus electrophysiology). Together, these results suggest that language model predictivity of the ventral stream is not necessarily due to language per se, but rather to the statistical structure of the visual world as reflected in language.