Many successful computational models for visual saliency have been proposed recently (such as Bruce & Tsotsos,
2009; Cerf, Frady, & Koch,
2009; Gao et al.,
2008; Harel, Koch, & Perona,
2007; Itti et al.,
1998; Judd et al.,
2009; Kadir & Brady,
2001; Kienzle et al.,
2009; Renninger, Coughlan, Verghese, & Malik,
2005; Rosenholtz,
1999; Tatler, Baddeley, & Gilchrist,
2005; Zhao & Koch,
2011). These models range from biologically plausible to pure computational and the combination of the two. Itti et al. (
1998) proposed a model inspired by the primate visual system. From what is known to be extracted in early cortical areas, they constructed a saliency map by combining color, contrast, and orientation features at various scales. They implemented a center-surround operation by taking the difference of feature-specific maps at two consecutive scales. The result for each feature is normalized, yielding three conspicuity maps. The overall saliency map is a linear combination of these conspicuity maps. Their influential approach has set a standard in saliency prediction. Rosenholtz (
1999) suggested that in visual search the saliency of a target depends on deviation of its feature values from the average statistics of the image. In other words, statistical outliers are salient. Bruce and Tsotsos (
2009) formulated this principle in terms of the information theory. They proposed a computational model where saliency is calculated as Shannon's self-information. Intuitively, image locations with unexpected content in comparison with their surrounding are more informative, and thus salient. Hou and Zhang (
2007) examined images in the spectral domain. They demonstrated that the statistical singularities in the spectrum domain correspond to regions in the image that differ from the surrounding. Thus, the spectral residual is indicative to the saliency. Judd et al. (
2009) combined machine learning techniques with successful saliency models and high-level image information. They jointly considered features from Itti et al. (
1998), Oliva and Torralba (
2001), and Rosenholtz (
1999), the steerable pyramid filters (Simoncelli & Freeman,
1995), location of the horizon, locations of some objects like people and cars, and position in relation to the center of the image. Then a classifier is trained on recorded eye movements to combine all features in an optimal way. We will demonstrate that the proposed method consistently outperforms the approaches of Bruce and Tsotsos (
2009) and Hou and Zhang (
2007) and reaches the level of Judd et al. (
2009), whereas our method does not require any learning.