Abstract
Humans are very adept at detecting and recognizing social interactions. However, the underlying computations that enable us to extract social information from visual scenes are still largely unknown. One theory proposes that humans recognize social relationships by simulating the inferred goals of others, and has been instantiated using generative inverse planning models. In contrast, recent behavioral and neural evidence has suggested that social interaction perception is a bottom-up, visual process separate from complex mental simulation. Relatedly, recent work has found that a purely visual model with relational inductive biases can successfully model human social interaction judgments, lending computational support to this bottom-up theory. To directly compare these two alternatives, we look at the relationship between our purely visual model (SocialGNN), and a generative inverse planning model (SIMPLE) with human ratings of animated shape videos resembling real-life social interactions. Using representational similarity analysis, we found that both SocialGNN and SIMPLE are significantly correlated with human judgments (r = .45 & r = .49, respectively). Interestingly, there is a significant amount of variance in human judgments that is uniquely explained by each model (sr = .30 & sr = .37 respectively), suggesting that humans engage both bottom-up and simulation-based processes to recognize social interactions, with each process possibly representing a different aspect of the stimulus. This work provides important insight into the extent to which humans rely on visual processing versus mental simulation to interpret different social scenes.