Abstract
Humans can visually recognize many actions performed by others, as well as communicate about them. How are action concepts organized in the mind? Recent work has uncovered shared neural representations of actions across vision and language, yet the semantic structure of these representations is not well understood. To address this, we curated a multimodal, naturalistic action set containing 95 videos of everyday actions from the Moments in Time dataset (Monfort et al., 2019) and 95 naturalistic sentences describing the same actions. We labeled each action with four semantic features: a specific action verb (e.g., chopping); an everyday activity (representing a set of actions; e.g., preparing food); the target of the action (e.g., an object); and a broad action class (e.g., manipulation; Orban et al., 2021). We also annotated the actions with other relevant social and action-related features. We used these features to predict behavioral similarity measured in two multiple arrangement experiments. Participants arranged the videos (N = 39) or sentences (N = 32) according to the actions’ similarity in meaning. In both experiments, the action target explained more unique variance in behavior than any other feature. In a cross-modal analysis, we mapped the semantic, action, and social features to the video similarity judgments to predict the sentence similarity judgments, and vice versa. Our feature set explained approximately 80% of the shared variance across modalities. Of all features, action target and action class were the best cross-modal predictors. Together, our results demonstrate the shared semantic organization of human actions across vision and language. This organization reflects broad semantic features, including action target and action class. Our results challenge commonly used definitions of action categories, and open exciting avenues for understanding how action concepts are represented in the mind and brain.