Abstract
There has been extensive research aimed at understanding how we perceive human bodies and human movements. A number of methodologies have been employed for this research: psychophysics, neuroimaging, and computational modeling. However, most studies, especially psychophysical and computational studies, have only considered the case of point light animations (Johansson, 1973). Meanwhile, in the field of computer vision, there has been a dedicated effort to identify human figures in real world images and videos. Computer vision models that operate on real world visual inputs can be informative for perceptual investigations. We were inspired by the observation that successful computer vision models of action and pose recognition perform qualitatively different computations - one driven by holistic features, and the other driven by deformable templates based on the human form. We wanted to understand if this difference in computational strategy for recognizing actions vs. poses reflected differences in human perceptual processing as well. We collected photographs of human figures performing one of six common activities (e.g., bending, kicking, etc.) and for each activity, included a variety of poses, backgrounds, lighting, body types and clothing. We annotated the pose in each photograph (Bourdev & Malik, 2009) to create 3D stick figure representations. Observers were then asked to recognize either the activity or the pose depicted in the photographs by selecting one of six options (e.g., list of activities, or set of stick figures). We found that action recognition was faster and more accurate than pose discrimination. Standard actions like walking or sitting can be predicted even when the human figure is occluded, which suggests the importance of real world cues like support surfaces and scene context for recognizing actions. Taken together, our findings support the use of divergent strategies in computer vision for recognizing actions (fast, context-driven) and poses (slower, template-based) in real world images.
Meeting abstract presented at VSS 2012