Abstract
Detection and pose estimation of people in images are challenging tasks due to variations in articulation, viewpoint and appearance. Part detectors are a natural way to attack this problem, but identifying good parts remains an open question. Anatomical parts, such as arms and legs, are difficult to detect reliably because parallel lines are common in natural images. In contrast, a visual conjunction such as “half of a frontal face and a left shoulder” may be a perfectly good discriminative visual pattern. Bourdev and Malik [ICCV 2009] introduced new parts, called poselets, which correspond to such discriminative visual patterns. There is a wide variety of poselets – a frontal face, a profile face, a head-and-shoulder configuration, etc. We discover them by choosing a random seed patch from the image of a random person in the training set and finding the “corresponding” patches in images of other people. A corresponding patch is defined as one that has the same spatial configuration of semantic keypoints (such as joints, eyes, nose) as the seed patch. We discriminatively train detectors for these patches. To find people in a test image, we evaluate the poselet detectors at multiple locations and scales and cluster the activations into person hypotheses. The activations within each cluster form a distributed representation of the pose of a person and provide the basis for numerous high-level vision tasks. Our system is the current best performer on the task of people detection and segmentation. We are able to infer attributes (the gender, style of hair, clothes, presence of glasses, hat, etc.) and actions (phoning, running, walking, reading a book, etc.) of people under arbitrary viewpoints and articulations. These ideas extend naturally to other visual categories. Interestingly, receptive fields of neurons in inferotemporal cortex have a variety consistent with that predicted by our model.
Adobe Systems, MURI N00014-06-1-0734, Google.