Abstract
Visual information about the human body is naturally organized in terms of body parts and their spatial relationships. Previous behavioral studies of body perception have largely focused on the visual interpretation of whole bodies using simplified stimuli. This leaves open the question of how well human observers can recognize individual body parts given limited spatial context and temporal constraints, and given the complex variability in appearances typical of natural images. To answer this question, we measured the accuracies with which human observers can recognize individual body parts in static natural images as a function of the amount of spatial information and stimulus duration. In an online experiment, observers were asked to identify one of eight body joints from natural images seen through small square apertures. The apertures were cropped from 340 images selected from MPII Human Pose dataset at widths that ranged from 12 to 60 pixels. These test images were presented randomly at durations that varied from 50 milliseconds to 2000 milliseconds, with or without backward masking. We found that recognition accuracy grew approximately linearly with aperture size from chance levels at 12.5% at the smallest size, to between 40 and 60%, depending on duration, at the maximum size. At the shortest duration, accuracy systematically increased from chance to 50% as a function of size. As duration increased accuracy reached over 50% at 36 pixels, leveling off to about 75% after 500 milliseconds. Additional experiments using Leeds Sports Pose Dataset largely replicated these findings. Our results suggest that human observers can use low- and mid-level cues extracted at a fairly early stage of visual processing. In addition, our data provide a useful benchmark of relatively comprehensive observation on human performance with which to compare computational models of single body part recognition in natural images.