Abstract
Based on an assumption that one main goal of the visual attention system is to direct attention towards objects of interest, we have derived a probabilistic model of salience. The resulting model, Saliency Using Natural Statistics (SUN), is grounded in probabilities learned through the experience of the natural world. These probabilities decompose into three parts: knowledge about what features are rare or novel, the visual appearance of particular objects of interest, and where those objects are likely to occur in a scene. Bottom-up saliency is defined by SUN as rare combinations of features, and an implementation of this component of the SUN model has been shown to achieve state-of-the-art performance predicting human eye-movement fixations when free viewing static images (Zhang et al., in press) and video. SUN's bottom-up saliency model also predicts visual search asymmetries that other models of bottom-up salience based only on the current image fail to capture. However, when interacting with the world, we do so with a focus. Models of visual attention likewise need to be driven by the task at hand. Here we implement the remaining portions of SUN, a location prior which guides attention to likely locations of the target and a probabilistic appearance model in the spirit of the Guided Search (Wolfe, 1994) and Iconic Search (Rao et al., 1996) models. We evaluate our model on the publicly available dataset of Torralba et al. (2006) that contains eye-tracking data collected from subjects asked to count people, cups, or paintings in indoor or outdoor scenes. We show that the full SUN model achieves superior performance in predicting human fixations, suggesting that learned knowledge of targets' appearance, location, and the rarity of features play a role in determining where to fixate.
NIH, NSF, and the James S. McDonnell Foundation.