Abstract
Perceiving 3D structure in natural images is an immense computational challenge for the visual system. While many previous studies focused on the perception of rigid 3D objects, we applied a novel method on a common set of non-rigid objects—static images of the human body in the natural world. Perception of body structure is particularly challenging due to various joint articulations with different frequency of occurrence, and appearance variations from changes due to occlusion, clothing, lighting, and viewpoint. As a result, natural images vary in pose typicality and the amount of information views provide to support body part parsing. We investigated (1) to what degree humans can interpret 3D body structure given viewpoint rotations about the vertical axis, and (2) to what extent this ability depends on a priori knowledge of 3D pose typicality and informativeness of viewpoints. Using a 2AFC pose matching task, we tested how well subjects were able to match a target natural pose image with one of two comparison, synthetic body images—one was rendered with the same 3D pose parameters as the target while the other was a distractor rendered with added noises on joint angles. Target natural images were drawn from the UP-3D dataset, whereas synthetic images were rendered with constant, predetermined clothing and lighting. Observers picked the synthetic pose which best matched the target despite changes in viewpoint about the vertical axis. We found that the ability to accurately match poses decreased with increasing differences between target and comparison viewpoints. When we grouped trials by typicality of underlying 3D poses and informativeness of viewpoints from natural images, we found that performance for typical poses was measurably better than atypical poses; however, we found no significant difference between informative and noninformative viewpoints. Our psychophysical results provide useful benchmarks for future model comparisons.