Abstract
Our visual worlds are filled with other people's actions – we watch others run, dance, cook, and crawl. How is this repertoire of visual actions organized? To approach this question, we obtained behavioral similarity measures over 60 videos depicting everyday actions sampled from the American Time Use Survey (ATUS). We then tested how well we could predict this structure using both high-level feature models and neural responses along the visual system. To obtain a similarity space of actions, 20 subjects arranged the videos so that similar actions were nearby (Kriegeskorte & Mur, 2012). Participants' representational structures were moderately similar (noise ceiling: r=0.29-0.36). To understand what properties characterize this representational space, we compared a range of models, reflecting high-level category information (ATUS labels, e.g. "fitness," "grooming"), mid-level models reflecting the role of body parts, and low-level models capturing more primitive visual shape features (gist). Cross-validated prediction scores revealed that category information and body part involvement predicted behavioral similarity moderately well compared to visual shape features (mean leave-1-out óA: body parts=0.15, category=0.14, action target=0.09, gist=-0.01). Additionally, visual cortex responses to the same videos measured using fMRI (N=13) did not predict similarities well (óA =0.08), indicating that this behavioral similarity space is not directly represented within visual cortex. These patterns of data were robust across two different video sets of the same 60 actions. These results on action similarity echo recent work in both object and scene domains (Jozwick, 2017; Groen, 2017): human similarity judgments are best predicted by higher-level properties related to items' functions (what they are), rather than by the mid-level visual features driving neural responses in the visual system (how they look). Broadly, these findings suggest that explicit similarity judgments may derive from an underlying categorical representation rather from a common multi-dimensional feature space.
Meeting abstract presented at VSS 2018