Abstract
Object representations (geometric models, component based descriptions, view based representations, etc.) pay little or no attention to contextual features. Traditional approaches in object detection and recognition in computer vision consider an image as a collection of patches or regions that have to be classified. These techniques can be very fragile and slow, requiring exhaustive scanning of the image (in location and scale) and each object is recognized independently. Contextual information is known to have a big influence in object recognition by humans. The identification of scene views will provide strong priors for object identities, locations and points of views. Even in unfamiliar environments, the categorization of a scene (a street, an indoor, etc.) will constrain the presence and location of objects in the image. We show that scene features (obtained by pooling low-level features across the whole image) can be use to prime the presence/absence of objects in the scene and to predict their location, scale and appearance before exploring the image. We show how global image features can predict with 80% accuracy the presence/absence of animals and people in scenes without applying object detection mechanisms. In this scheme, visual context information is used early in the visual processing chain, in order to provide an efficient short cut for object detection and recognition.