Humans and other primates have a tremendous ability to rapidly direct their gaze when looking into a static or dynamic scene and to select visual information of interest. This ability enables them to deploy limited processing resources to the most relevant visual information and understand real-world scenes rapidly and accurately. Understanding and simulating this mechanism has both scientific and economic impact (Koch & Ullman,
1985; Ungerleider,
2000; Treue,
2001). A computational model predicting where humans look has general applicability in a wide range of tasks relating to human-robot interaction, surveillance, advertising, marketing, entertainment, and so on. One common approach is to take inspirations from the functionality of human visual system (Milanese,
1993; Tsotsos et al.,
1995; Itti, Koch, & Niebur,
1998; Rosenholtz,
1999), while some other studies claim that visual attention is attracted to the most informative regions (Bruce & Tsotsos,
2009), the most surprising regions (Itti & Baldi,
2006), or those regions that maximize reward regarding a task (Sprague & Ballard,
2003). Existing works on saliency modeling mainly focus on pixel-level image attributes, such as contrast (Reinagel & Zador,
1999), edge content (Baddeley & Tatler,
2006), orientation (Itti et al.,
1998), intensity bispectra (Krieger, Rentschler, Hauske, Schill, & Zetzsche,
2000), and color (Itti et al.,
1998; Jost, Ouerhani, von Wartburg, Muri, & Hugli,
2005; Engmann et al.,
2009), despite various recent developments on inference (Raj, Geisler, Frazor, & Bovik,
2005; Walther, Serre, Poggio, & Koch,
2005; Gao, Mahadevan, & Vasconcelos,
2007; Harel, Koch, & Perona,
2007; Bruce & Tsotsos,
2009; Seo & Milanfar,
2009; Carbone & Pirri,
2010; Chikkerur, Serre, Tan, & Poggio,
2010; Wang, Wang, Huang, & Gao,
2010; Hou, Harel, & Koch,
2012) to generate a saliency map.