Abstract
One of the open questions in understanding human object recognition is the role of local features in object recognition. Is the processing of local features necessary for recognition of whole objects? Are local features processed independently?
Here, we tested human subjects' detection performance on whole object and partial object images containing 1, 2 or 3 features near perceptual threshold. For all image types, half of the stimuli contained gray level object images to which white Gaussian noise was added and the other half revealed the same regions in an image but contained only noise. The noise variance was adjusted in inverse proportion to the image area revealed such that the effective stimulus (Etot∼ area/snoise) was constant across image types. Subjects were asked to judge whether images contained something (object or object part) or noise. Partial images contained either semantic or computer generated features. Two object categories (faces and cars) were tested (12 subjects each) with both types of feature sets.
For cars, we found that detection accuracy on partial images was higher than on whole images except for maximal Etot when subjects attained ceiling performance. For faces, performance on partial images was greater or equal to performance on whole images at low Etot. At higher Etot, however, the dynamics reversed and the performance on whole images was similar or better than the performance on partial images. When the location of features in partial images with 2 or 3 features was spatially rearranged, we found that the detection accuracy decreased for faces, but not for cars.
Overall, our results suggest that object detection near perceptual threshold is limited by the detection of local features. However, face detection seems to utilize, to a greater degree, global information enabled by non-independent processing of several local features.