Abstract
Perceiving natural scenes with multiple objects presents a serious challenge to any visual system. Selective visual attention provides a mechanism for serializing the visual information in order to perceive one object at a time. If no prior knowledge or expectation about the scene is available, attention will be guided from the bottom up by salient image features, computed from low-level image properties. In previous models of saliency-based visual attention from our lab (Koch & Ullman 1985; Itti, Koch & Niebur 1998), attention was guided to salient image locations by a winner-take-all (WTA) neural network operating on a saliency map. However, in order to effectively attend to objects one at a time, it is necessary to know their approximate size and extent before they are recognized. We developed a new version of our model that estimates the approximate extent of attended proto-objects. After computing the saliency map and the most salient location, feedback connections in the hierarchy of maps leading to the saliency map identify the most salient feature at the attended location. The corresponding feature map is segmented around that location using a network of linear threshold units. Our model performs well on natural images, and we present results of successfully using the model to learn and recognize individual objects in highly cluttered scenes. The entire saliency code is available to the community as a Matlab toolbox (http://www.saliencytoolbox.net).