Abstract
Every day, we see people perform many different actions, some of which naturally seem more similar (e.g. running and walking) while others are more different (e.g. running and cooking). What properties of actions capture this intuitive perception of similarity?
To address this question, we asked sixteen participants to watch 60 videos depicting everyday actions, then place intuitively similar actions closer together in space (Kriegeskorte & Mur, 2012). We then tested how well a range of features predicted these action similarities, spanning low-level shape features (gist; Oliva & Torralba, 2001), intermediate-level features capturing the body parts involved in the actions and the actions’ targets in the world (Tarhan & Konkle, 2019), and high-level semantics.
To operationalize high-level semantics, we used verbal descriptions to extract each video’s embedding in a neural network feature space. First, 3 observers verbally described each video. Then, transcripts of their descriptions were passed through a deep neural network model trained on natural language processing (BERT; Devlin et al., 2019), producing a 1024-dimensional feature code for each description.
Gist features did not predict action similarity judgments well (mean leave-1-out tau-a = 0.07), while body part and action target features performed better (tau-a = 0.2 and 0.12, respectively). Notably, semantic BERT features performed best (tau-a = 0.25), approaching the noise ceiling. We replicated these results in a second group of participants (N = 20).
These findings suggest that humans perceive action similarity primarily in terms of semantic properties that can be extracted from natural language. This may be because verbal descriptions capture actions’ larger event structures, going beyond the bodies and objects that make up their elemental components. In addition, our use of verbal descriptions and natural language processing models introduces a tractable way to measure semantic features for real-world, complex videos.