Abstract
Previous work has shown that humans preferentially allocate attention to animate objects (Lindh et al., 2019) and spontaneously attribute mental states to abstract animations (Heider & Simmel, 1944). In this work, we study whether human similarity judgments of scenes depicting abstract shapes are based on social traits attributed to these shapes (goals, relationships, and strengths) or their motion features (e.g., velocities, distances). We designed an experiment, in which subjects first watched a reference video of 2D geometric shapes, and then judged the similarity of this video to two video prompts: one matched on motion cues, and the other matched on social traits. The shapes in the videos were two agents and one object, all of which followed continuous motion trajectories controlled by a hierarchical planner and a physics engine, and could express a variety of goals, relationships, and strengths. We found that human similarity judgments were best predicted by a model based on the inference of social traits, as recovered by our inverse planning algorithm, with the strongest contribution of the most salient trait, relationships. Moreover, our inverse planning model also accounts for variance in human responses, quantitatively modeled as uncertainty in the inference. In contrast, a baseline neural network model trained to predict video similarity based on motion features (using an independent set of stimuli) failed to predict human responses. A second baseline neural network model, trained on motion features to predict the agents’ social traits for the similarity judgments, likewise failed to predict humans’ judgments of out-of-sample stimuli -- showing that social trait information is not available on the level of motion features. Taken together, our results suggest that human similarity judgments of abstract scenes can not be explained by low-level visual cues alone, but are guided by theory-based social attribution perceived by the observer.