Recent years have seen an increase in the number of computational approaches to visual saliency estimation. Starting from the seminal work by Itti, Koch, and Niebur (
1998) most of the proposed saliency models consider a bottom-up strategy in which a saliency map is extracted in a purely data-driven manner by considering center-surround differences (e.g., Gao & Vasconcelos,
2007; Harel, Koch, & Perona,
2007; Seo & Milanfar,
2009). There are also some studies that carry out such computations in the frequency domain (Achanta, Hemami, Estrada, & Susstrunk,
2009; Hou & Zhang,
2007) or make use of natural image statistics (Bruce & Tsotsos,
2006; Zhang, Tong, Marks, Shan, & Cottrell,
2008). Another important line of models integrates low-level cues with some task-specific top-down knowledge such as face and object detectors (Cerf, Harel, Einhaeuser, & Koch,
2007; Goferman, Zelnik-Manor, & Tal,
2010; Judd, Ehinger, Durand, & Torralba,
2009), and global scene context (Torralba, Oliva, Castelhano, & Henderson,
2006) to improve their predictions. Lastly, some recent studies pose saliency estimation as a supervised learning problem (Judd et al.,
2009; Liu, Jian Sun, & Shum,
2007; Zhao & Koch,
2011,
2012).