Abstract
Humans rapidly make sense of an ever-changing visual world, extracting information about people’s actions in a wide range of settings. Yet it remains unclear how the brain processes this complex information, from the extraction of perceptual details to the emergence of abstract concepts. To address this, we curated a naturalistic dataset of 95 short videos and sentences depicting everyday human actions. We densely labeled each action with perceptual features like scene setting (indoors/outdoors), action-specific features like tool use, and semantic features categorizing actions at different levels of abstraction, from specific action verbs (e.g. chopping) to broad action classes (e.g. manipulation). To investigate when and where these features are processed in the brain, we leveraged a multimodal approach, collecting EEG and fMRI data while participants viewed the action videos and sentences. We applied temporally and spatially resolved representational similarity analysis and variance partitioning to characterize the neural dynamics of action feature representations. We found that action information is extracted in the brain along a temporal gradient, from early perceptual features to later action-specific and semantic information. We mapped action-specific and semantic features to areas in parietal and lateral occipitotemporal cortices. Using cross-decoding across videos and sentences, we identified a late (~500 ms) modality-invariant neural response. Our results characterize the spatiotemporal dynamics of action understanding in the brain, and highlight the shared neural representations of human actions across vision and language.