The human retina provides highly accurate and detailed central vision, but acuity diminishes rapidly with eccentricity. Eye movements shift new locations to central vision, and in doing so sequentially sample finer grained details from locations that are likely to yield important information, presumably using some combination of peripheral visual signals, inferences based on context, and top-down strategies. Each eye movement during extended search can therefore be useful for understanding how the visual system combines and prioritizes information both within each fixation and across a sequence of fixations.
Much of the research on visual search to date has formalized this general issue by focusing on questions of
feature extraction and of
strategy. Feature extraction includes both top-down guided search (Wolfe,
2007; Zelinsky,
2008) and stimulus-driven (saliency) effects (Gao, Mahadevan, & Vasconcelos,
2008; Itti & Baldi,
2009; Itti & Koch,
2000). For the abstract and discrete search items commonly used as visual-search stimuli, categorical features such as color, orientation, shape, and size are often used. Simple qualitative comparisons between the search items and the target can be used to model top-down guidance (Pomplun, Shen, & Reingold,
2003; Rutishauser & Koch,
2007). For more complex stimuli, such as a target hidden in image noise or in a photograph of a natural scene, there is no discrete set of items to consider, and more sophisticated image-processing techniques are required (Hwang, Higgins, & Pomplun,
2009; Pomplun,
2007; Rao et al.,
2002; Tavassoli, van der Linde, & Bovik,
2009; Zelinsky,
2008). In either case, the output of a feature-extraction mechanism is an activation map—that is, a representation of the visual array in which peaks of activity represent the priority of locations for eye movements.
Strategy refers to the mechanism for selecting which location to inspect next. While a number of different mechanisms have been put forward, the most commonly implemented has been the maximum a posteriori (MAP) observer. The MAP observer directs saccades to the current maximum of the activation map and a simple inhibition-of-return (IOR) mechanism is used to stop the model returning to previously fixated maxima. Depending on the model, a maximum will represent either a search item or the center of gravity of a number of search items. As most previous computational models have primarily been interested in the feature-extraction stage of search, the MAP observer has often been used for simplicity (Clarke, Green, & Chantler,
2009; Itti & Koch,
2000; Pomplun et al.,
2003; Rao et al.,
2002; Rutishauser & Koch,
2007; Zelinsky,
2008).
An alternative to the MAP observer is the ideal observer. Here, eye movements are directed to locations that are likely to yield the most information. An example of an ideal-observer model comes from Najemnik and Geisler (
2005), who measured visual sensitivity to a Gabor patch in varying amounts of noise across a range of eccentricities and angles from fixation. From the visual-sensitivity data they could generate a model of optimal eye-movement behavior that selected as the next fixation the location that would maximize the probability of detecting the target, given the amount of background noise and the known visibility of the target at various eccentricities. The number of fixations made during search for the target by the human observers (the two authors) closely matched the optimal model. In a second study (Najemnik & Geisler,
2008) they also measured the fixations generated by an optimal model and found that, when averaged over all trials, the ideal observer matched the human spatial distribution of fixations: Both the model and human observers exhibited a preference for fixating above and below the center of the image. The idea that eye movements during search are near optimal is broadly consistent with studies demonstrating the speed and efficiency with which eye movements can be directed to locations in a naturalistic setting that provide the most task-relevant information. A now-classic example is the demonstration of expert cricket batsmen's ability to shift their eyes rapidly to the anticipated bounce point of the ball based on its trajectory as it leaves the bowler's hand (Land & McLeod,
2000). This is a specific example of a number of studies demonstrating that eye movements are tightly constrained by task goals and driven to maximize task-relevant information gain (for a recent review, see Hayhoe & Ballard,
2014).
In contrast with the notion that humans are close to optimal in search behavior is recent evidence of suboptimality in a very similar context. Morvan and Maloney (
2012) instructed observers to first make a single eye movement to their choice of one of three squares aligned in a row and to then make a judgment about a dot that could appear in either the left- or the rightmost square. When the squares are closely spaced, the center location is the optimal choice because the dot will be visible whether it appears in the left or the right location. As the distance between the squares increases, a point is reached where the center location is no longer optimal; instead, observers can maximize accuracy by selecting either the left or the right location. A single saccade in this experiment represents a very similar decision to each saccade in the search task of Najemnik and Geisler (
2005), in that observers must use knowledge about their own visual acuity to guide their eyes to the location that is likely to yield the most information. Nonetheless, observers in Morven and Maloney's experiment were far from optimal: Not only did they not change strategy at an optimal spacing, they did not adapt their strategy to changes in the spacing of the squares at all, even though they were given a monetary reward for each correct response. This finding has been recently replicated and generalized by Clarke and Hunt (
2015), and a similar conclusion was also reached by Verghese (
2012), who demonstrated that observers failed to adapt their visual-search strategies to take target probability information into account, and by Zhang et al. (
2012), who found suboptimal eye–hand coordination in a reaching task.
How can these demonstrations of suboptimal eye-movement behavior be reconciled with the findings of Najemnik and Geisler (
2005,
2008)? It is unlikely that observers would be suboptimal at the level of a single saccade but optimal across multiple saccades. Morvan and Maloney (
2012) suggest that observers may adopt heuristics during search that generate sequences of saccades that appear optimal but are not actually based on a fixation-by-fixation computation of posterior probability of target location. General tendencies observed in search scan paths that have been taken to be indicative of optimal behavior may instead be biases in saccade selection related to the scene statistics, the location of the eyes within the scene boundaries, and local mechanisms like IOR and saccadic momentum. For example, Over et al. (
2007) have shown that search scan paths exhibit coarse-to-fine structures—that is, observers make shorter saccades as search progresses. Over, Hooge, and Erkelens (
2003) found that saccade directions are influenced by the edges of the search image and reported a preference for making saccades parallel to the boundaries of the stimuli. Gilchrist and Harvey (
2006) argue that the presence of a horizontal bias in saccade directions indicates systematic scanning in visual search. They suggest that these systematic tendencies can be hard to detect in scan paths because of interactions with salience-based object selection.
Chance has also been demonstrated to play a significant role in visual-search performance. Using saccade amplitude distributions, Motter and Holsapple (
2001) calculated the probability of fixating the target by chance under different conditions. While this chance component decreases as the number of distracters increases, it continues to account for a sizable fraction of performance. In natural-scene viewing, spatial biases to move the eyes in particular ways play an important role in fixation selection; and indeed, how the eyes move (in terms of saccade amplitude and direction) can provide a better account of fixation selection than can what visual information is selected (Tatler & Vincent,
2009). Random walks have been successfully used to model an observer's speed and accuracy in present/absent forced-choice experiments (Reeves, Santhi, & DeCaro,
2005; Stone,
1960). Rather than model the spatial distribution of fixations, these models simulate the observer's decision-making process. The random walk occurs between two boundaries (one for a target-present response and one for target absent) and is governed by a drift and bias.
A plausible alternative to the optimal model of human search behavior is therefore that natural search behavior is stochastic but constrained by both scene statistics and heuristics. Here we directly compare the performance of a random-walk model to human eye movements during search of a textured surface for an indentation (for an example, see Clarke et al.,
2008, figure 1). We used textured surfaces because they appear naturalistic but, unlike photographs of natural scenes, are fully controlled and parameterized. Our stochastic model randomly selects the next saccade in the sequence from the total set of saccades made from that region of the search array. This model captures the global biases of saccade programming during search that we have already reviewed, but unlike the optimal model, it does not take into account previous fixations or the visibility of the target given the roughness of the surface texture. The results demonstrate that the stochastic model closely matches the number of fixations required to detect the target in human data.