Abstract
How do we classify other people's actions, and what information do we use to do it? Ever since Johansson's introduction of Point Light Displays (PLDs) it has been known that human observers can integrate a few moving points into a human form and, in many cases, infer actions, intentions, and emotions. However it is not clear what kind of representation supports this ability---dynamically changing shape, or just motion flow? Some studies have suggested that motion flow suffices to achieve action classification, but the evidence for this is based on very limited action sets, such as distinguishing walking from noise. This raises questions about the sufficiency of various kinds of visual information to support the recognition of a broader range of actions. To address this gap, we used OpenPose to generate reduced video stimuli from multiple action videos drawn from naturalistic scenes. This research uses 78 actions, including everyday activities like brushing teeth, drinking water, etc. For these 78 actions, three distinct types of videos were created: PLDs (consisting of dots at joint locations), Stick Figures (joint positions joined in an anatomically correct body plan), and Motion Flow Videos (flow fields based on the Lucas-Kanade algorithm). Participants were asked to identify the actions in each video (free text description). We employed a Natural Language Processing model to estimate the semantic similarity of each participant's response to that of others, allowing us to automatically estimate intersubjective agreement. Further analysis was conducted using a Hierarchical Bayesian Model, which compared the posterior and predictive distributions of the semantic similarities across each video condition. Intersubjective agreement was highest with Stick Figures, followed by PLDs, and lowest for Motion Flow videos, suggesting that dynamic pose representations are indeed required for accurate action classification, and that motion flow supports at most a coarser classification.