Scrutinizing long fixations observers made when viewing a picture of a painting, Buswell hypothesized in 1935 that “the main centers of interest, as judged by number of fixations, also receive the fixations which are longest in duration” (p. 90). He thereby linked two fundamental aspects of eye guidance in scene viewing: fixation number or fixation probability (
Where do the eyes preferentially fixate?) and fixation duration (
When do the eyes proceed to the next location?). In the eight decades since Buswell's seminal study, both aspects of gaze guidance have received considerable research interest, but they have been accounted for separately. This when/where separation goes back to an early suggestion of distinct oculomotor control circuits (van Gisbergen, Gielen, Cox, Bruijns, & Kleine Schaars,
1981). Accordingly, Findlay and Walker (
1999) proposed an influential qualitative model of oculomotor control that completely separated the when and where systems. Models of eye-movement control in reading have also adopted this separation (Engbert, Nuthmann, Richter, & Kliegl,
2005; Reichle, Rayner, & Pollatsek,
2003). Empirical and computational research on real-world scene viewing has focused on the where decision to a great extent. The widely used saliency map (Itti, Koch, & Niebur,
1998) and its recent variants have had some success in predicting fixation probability (for a review, see Borji, Sihite, & Itti,
2013a; but see Tatler, Hayhoe, Land, & Ballard,
2011). However, the original implementation of the saliency map failed to achieve realistic fixation durations: Using biophysically realistic time constants, the dwell times of the “focus-of-attention” were too low to be interpreted as fixation durations (Itti et al.,
1998); conversely, when model parameters were constrained by search time, the model arrived at unrealistically long fixation durations (Itti & Koch,
2000). Following this lead, more recent developments of salience-type models, which have improved the predictive power for fixation probability (Bruce & Tsotsos,
2009; Erdem & Erdem,
2013; Garcia-Diaz, Leborán, Fdez-Vidal, & Pardo,
2012; Harel, Koch, & Perona,
2007; Lin & Lin,
2014; Xu, Jiang, Wang, Kankanhalli, & Zhao,
2014; Zhang, Tong, Marks, Shan, & Cottrell,
2008), have consistently ignored the issue of fixation duration. Conversely, the CRISP model of fixation durations in scene viewing (Nuthmann & Henderson,
2012; Nuthmann, Smith, Engbert, & Henderson,
2010) models the control of fixation durations without taking fixation locations into account. To summarize, models of eye guidance in natural scene viewing have addressed
either the where
or the when decision, but not both.