From the beginning of eye tracking, we know that faces attract gaze and capture visual attention more than any other visual feature (Buswell,
1935; Yarbus,
1967). When present in a scene, faces invariably draw gazes, even if observers are explicitly asked to look at a competing object (Bindemann, Burton, Hooge, Jenkins, & de Haan,
2005; Theeuwes & Van der Stigchel,
2006). Many studies have established that face perception is holistic (Boremanse, Norcia, & Rossion,
2013; Farah, Wilson, Drain, & Tanaka,
1998; Hershler & Hochstein,
2005) and pre-attentive (Bindemann, Burton, Langton, Schweinberger, & Doherty,
2007; Crouzet, Kirchner, & Thorpe,
2010), and the brain structures specifically involved in face perception have been pointed out (Haxby, Hoffman, & Gobbini,
2000; Kanwisher, McDermott, & Chun,
1997). Despite their leading role in attention allocation, faces have rarely been considered in visual attention modeling. Over the past 30 years, numerous computational saliency models have been proposed to predict where gaze lands (see Borji & Itti,
2012, for a taxonomy of 65 models). Most of them are based on Treisman and Gelade's (
1980) Feature Integration Theory, stating that low-level features (edges, intensity, color, etc.) are extracted from the visual scene and combined to direct visual attention (Itti, Koch, & Niebur,
1998; Koch & Ullman,
1985; Le Meur, Le Callet, & Barba,
2007; Marat et al.,
2009). However, these models cannot be generalized to many experimental contexts, since the dynamic and social nature of visual perception are not taken into account (Tatler, Hayhoe, Land, & Ballard,
2011). Typical examples in which they fail dramatically are visual scenes involving faces (Birmingham & Kingstone,
2009). More recently, visual saliency models combining face detection with classical low-level feature extraction have been developed and have significantly outperformed the classical ones (Cerf, Harel, Einhäuser, & Koch,
2008; Marat, Rahman, Pellerin, Guyader, & Houzet,
2013).