Abstract
When shown a photograph of a person, humans have a vivid, immediate sense of 3D pose awareness and a rapid understanding of the subtle body language, personal attributes, or intentionality of that person. How can this happen and what do humans perceive? How accurate are they? Our aim is to unveil the process and level of accuracy involved in 3D perception of people from images by assessing the human performance. Our approach to establishing an observation-perception link is to make humans re-enact the 3D pose of another person (for which ground truth is available), shown in a photograph, following a short exposure time of 5 seconds. Our apparatus simultaneously captures human pose and eye movements during the pose re-enacting performance. In the process of perceiving and reproducing the pose, subjects attend firstly upper body joints with a general trend of focusing more on extremities than internal joints. Although the resulting scanpaths are pose-dependent, they are quite stable across subjects both spatially and sequentially. Our study reveals that people are not significantly better at re-enacting 3D poses given visual stimuli, on average, than existing computer vision algorithms. Errors in the order of 10°-20° or 100mm per 3D body joint position are not uncommon. The contribution of our work can be summarized as follows: (1) the construction of an apparatus relating the human visual perception with 3D ground truth; (2) the creation of a dataset (publicly available) collected from 10 subjects, containing 120 images of humans in different poses, both easy and difficult, and (3) quantitative analysis of human eye movements, 3D pose reenactment performance, error levels, stability, correlation as well as cross-stimulus control, in order to reveal how different 3D configurations relate to the subject focus on certain features in images, in the context of the given task.
Meeting abstract presented at VSS 2014