Abstract
Recognizing social interactions in visual scenes is a crucial human ability, however, the neural computations that enable this remain undetermined. Prior work has shown that two distinct computational models, a bottom-up graph neural network (SocialGNN) based solely on visual information and a generative inverse planning model (SIMPLE) based on mental state inference, both uniquely explain significant variance in human judgements. Here, we compare both models to neural data to understand how the brain combines these two types of computations for social scene understanding. We collected fMRI data from adults while they watched videos of two animated agents interacting. We compared neural representations with human behavior judgements and each computational model of social interaction recognition. Preliminary analysis using whole-brain searchlight RSA showed a significant correlation between neural RDMs and behavioral RDM in the visual cortex, lateral occipital temporal cortex and the superior temporal sulcus (STS). With the computational models, we find that SocialGNN exhibited a significantly higher correlation than SIMPLE in more posterior and dorsal regions including the lateral occipital cortex (LOTC) and posterior STS - regions previously implicated in social perception. On the other hand, SIMPLE demonstrated significantly higher correlation than SocialGNN in more anterior regions, including anterior STS and medial prefrontal cortex (mPFC). Further, both SocialGNN and SIMPLE explain significant variance in posterior and mid regions of the STS suggesting these regions as a potential site of integration of social perception and mental state inference. This work provides a novel framework for testing computational theories of social perception and cognition, as well as preliminary evidence for how the brain combines bottom-up vision and mental state inference during social scene understanding.