Free
Article  |   August 2011
Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment
Author Affiliations
Journal of Vision August 2011, Vol.11, 9. doi:10.1167/11.9.9
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Stephen C. Mack, Miguel P. Eckstein; Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. Journal of Vision 2011;11(9):9. doi: 10.1167/11.9.9.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

There is accumulating evidence that scene context can guide and facilitate visual search (e.g., A. Torralba, A. Oliva, M. S. Castelhano, & J. M. Henderson, 2006). Previous studies utilized stimuli of restricted size, a fixed head position, and context defined by the global spatial configuration of the scene. Thus, it is unknown whether similar effects generalize to natural viewing environments and to context defined by local object co-occurrence. Here, with a mobile eye tracker, we investigated the effects of object co-occurrence on search performance under naturalistic conditions. Observers searched for low-visibility target objects on tables cluttered with everyday objects. Targets were either located adjacent to larger, more visible “cue” objects that they regularly co-occurred in natural scenes (expected condition) or elsewhere in the display, surrounded by unrelated objects (unexpected condition). Mean search times were shorter for targets at expected locations as compared to unexpected locations. Additionally, context guided eye movements, as more fixations were directed toward cue objects than other non-target objects, particularly when the cue was contextually relevant to the current search target. These results could not be accounted for by image saliency models. Thus, we conclude that object co-occurrence can serve as a contextual cue to facilitate search and guide eye movements in natural environments.

Introduction
Due to poor visual acuity in the periphery, the process of visual search involves guiding the high-resolution fovea by a series of ballistic eye movements to points of interest in the environment to gather information for a perceptual decision. Since the time available to make these decisions is rarely unlimited, it is important for observers to choose eye movements that maximize the probability of making correct perceptual decisions while minimizing search time. Therefore, the manner in which visual search is carried out is far from random, as observers exploit regularities in their visual environments to promote efficient visual search. 
Two such types of regularities in the visual environment have been the subject of a wealth of previous research regarding eye movement selection during search. As intuition would suggest, much research has concluded that the known or inferred visual properties of the search target are often used to guide observers' eye movements (i.e., saccades are directed toward parts of the scene that resemble the target; Beutter, Eckstein, & Stone, 2003; Castelhano & Heaven, 2010; Eckstein, Beutter, Pham, Shimozaki, & Stone, 2007; Eckstein, Thomas, Palmer, & Shimozaki, 2000; Findlay, 1997; Malcolm & Henderson, 2010; Rajashekar, Bovik, & Cormack, 2006; Rao, Zelinsky, Hayhoe, & Ballard, 2002; Tavassoli, van der Linde, Bovik, & Cormack, 2007; Zelinsky, 2008). Additionally, a similarly large body of research has investigated the effects of saliency on eye movement selection during search. Although saliency has been shown to be at least somewhat predictive of saccadic selection with simplistic stimuli and tasks (e.g., Bruce & Tsotsos, 2009; Itti & Koch, 2000, 2001; Itti, Koch, & Neibur, 1998; Koch & Ullman, 1985; Lamy & Zoaris, 2009; Parkhurst, Law, & Neibur, 2002; Rosenholtz, 1999; Tatler, Baddeley, & Gilchrist, 2006), saliency as a predictor of eye movement selection begins to falter in more naturalistic tasks and environments (e.g., Hayhoe, Shrivastava, Mruczek, & Pelz, 2003; Henderson, Brockmole, Castelhano, & Mack, 2007; Henderson, Malcolm, & Schandl, 2009; Land & Hayhoe, 2001; Tatler, 2009). 
More recently, researchers have begun to demonstrate that a third type of regularity in the visual environment, the spatial relationship between a search target and its surrounding visual context, can also impact eye movement selection as well as facilitate visual search performance (e.g., Brockmole, Castelhano, & Henderson, 2006; Brockmole & Henderson, 2006a, 2006b; Castelhano & Heaven, 2010; Castelhano & Henderson, 2007; Chun, 2000; Chun & Jiang, 1998, 1999; Eckstein, Drescher, & Shimozaki, 2006; Ehinger, Hidalgo-Sotelo, Torralba, & Oliva, 2009; Hollingworth, 2009; Jiang & Wagner, 2004; Neider & Zelinsky, 2006; Oliva, Torralba, Castelhano, & Henderson, 2003; Torralba, Oliva, Castelhano, & Henderson, 2006). 
The ability of scene context to guide and facilitate search has been demonstrated in a wide array of stimulus types, ranging from synthetic stimuli to natural images. In a paradigm deemed contextual cueing, Chun et al. demonstrated that for arbitrarily arranged displays consisting of highly synthetic stimuli (i.e., oriented Ts and Ls), search times to locate targets were significantly shorter for stimulus configurations that were repeated throughout the experiment than novel arrangements of elements (Brady & Chun, 2007; Chun & Jiang, 1998, 1999, 2003; Jiang & Wagner, 2004; Olson & Chun, 2002; for review, see Chun, 2000). Similar improvements in visual search performance have since been demonstrated, as repeated visual context has been shown to lead to decreased search times for synthetic displays (Brockmole, Hambrick, Windisch, & Henderson, 2008; Hollingworth, 2009; Neider & Zelinsky, 2006) and synthetic search targets (a T or L) embedded in natural images (Brockmole et al., 2006; Brockmole & Henderson, 2006a, 2006b; Ehinger & Brockmole, 2008). In addition to search time facilitation, scene context has been shown to guide visual search as evidenced by increased saccadic accuracy for both synthetic displays (Droll, Abbey, & Eckstein, 2009; Hollingworth, 2009; Neider & Zelinsky, 2006; Peterson & Kramer, 2001) and synthetic search targets in natural images (Brockmole & Henderson, 2006a). 
Synthetic stimuli provide a useful starting point in assessing whether scene context can effectively guide and facilitate human search performance. However, the statistics of natural viewing environments are far more complex than arbitrary target–distractor pairings or the simplistic statistical structures of synthetic stimuli and must be learned by observers through normal visual experience. Although the statistics of natural scenes are not fully understood (Chun, 2000), much research has nevertheless focused on how context can guide search in natural images. 
Eckstein et al. (2006) instructed participants to search for target objects embedded in natural images, which could either appear in an expected location (e.g., chimney on the top of a house), an unexpected location (e.g., chimney on the ground), or not appear in the image at all. They demonstrated that, regardless of the target location (including target-absent cases), the first saccade was reliably directed toward the target's expected location. This tendency for observers to guide their visual search toward expected locations in an image for a variety of natural scene types (e.g., looking near the ground for pedestrians in a street scene) has since been well established (Castelhano & Heaven, 2010; Droll & Eckstein, 2008; Ehinger et al., 2009; Hidalgo-Sotelo, Oliva, & Torralba, 2005; Oliva et al., 2003; Torralba et al., 2006). Consistent with research utilizing synthetic stimuli, such results reinforce the idea that the learning of the spatial relationships between objects and their surrounding context can be exploited to promote efficient visual search. In fact, such results have dictated that a variety of contemporary models of visual search in natural images explicitly include a term that emphasizes contextually relevant regions of an image when determining eye movement selection (Droll et al., 2009; Eckstein et al., 2006; Ehinger et al., 2009; Oliva et al., 2003; Torralba, 2003; Torralba et al., 2006). 1  
Despite the growing body of literature suggesting a significant role of context in the guidance of visual search, there have been no demonstrations that these findings directly translate to natural environments and viewing conditions. Although intuition may suggest that the basic pattern of results previously reviewed from computer-based stimuli would readily translate to similar search paradigms in everyday viewing environments, there are notable disparities between such tasks, which would warrant the testing of this assumption. 
First, stimuli presented on computer monitors are greatly restricted in their spatial extent in comparison to observers' full viewing environment. Of the research cited in this article, image sizes ranged from 15.8° × 11.9° (Torralba et al., 2006) to 37.2° × 28.3° (Chun & Jiang, 1998); even the largest images used only subtended a fraction of the visual field. Given mounting evidence for visual search guidance based on retinally peripheral visual information (e.g., Findlay, 1997; Najemnik & Geisler, 2005), there is reason to believe that restricting the spatial extent of visual stimuli so drastically could significantly impact search strategy. 
Additionally, many of the aforementioned studies utilized eye tracking with a fixed head position (e.g., Brockmole & Henderson, 2006a; Ehinger et al., 2009; Neider & Zelinsky, 2006; Torralba et al., 2006), necessitating highly eccentric eye movements in relation to head position to foveate the entire stimulus image. Generally, smaller eye movements (<10°) can be and are completed without an accompanying head movement, but larger shifts of gaze naturally involve coordinated movements of the head and torso to complete (Land, 2004b). At the very least, the ability to move the head, as during everyday visual search, may reduce the necessity of these extremely eccentric eye movements, potentially leading to a divergent pattern of eye movements (Steinman, 2003). In fact, there is evidence that eye movement selection can differ both qualitatively and quantitatively during the performance of the same task depending on whether the head is free to move or not (Kowler et al., 1992). To address these potential issues, we followed the trend of recent work exploring eye movement selection in natural tasks and settings (Gajewski, Pearson, Mack, Bartlett, & Henderson, 2005; Hayhoe & Ballard, 2005; Hayhoe et al., 2003; Land, 2004a; Land & Hayhoe, 2001) to examine the effects of spatial context on the guidance of visual search to an everyday viewing environment and simple, natural task. 
In addition to examining visual search strategies in a natural viewing environment, the current research also looked to extend the scope of the literature investigating the effects of scene context on search from global scene context to an alternate conceptualization of context, object co-occurrence. The majority of the previously discussed studies define scene context by the global and configural properties of a given scene; context-constrained targets are more likely to appear in one region of the scene than another as dictated by these general spatial relationships (e.g., blimps most likely appear in the sky (Neider & Zelinsky, 2006), while pedestrians appear lower in images, on the road or sidewalk (Torralba et al., 2006)). 
Although this conceptualization certainly captures one type of information observers can and do learn about natural scenes, much more defines a scene than just its gross spatial layout. In fact, the objects that comprise a scene as well as the spatial relationships between them are vital to our understanding of scene context. What makes a space an office instead of a bedroom is determined primarily by the identity of the objects that it contains. Particular objects are more likely to be found in particular scene types (you are more likely to find a blender in a kitchen than a playroom), and the presence of particular objects is often accompanied by the presence of a select few others (pillows tend to be on or near a bed in a bedroom). This knowledge of what objects tend to appear in what scenes, which objects they appear with, and where they are likely to be in relation to each other all provide valuable information that could be used to direct the course of visual search. The idea of defining scene context by object co-occurrence is not new in the literature (e.g., Bar, 2004), but it has been applied sparingly to visual search in natural scenes and environments. Therefore, the present study also focused on whether information about contextually learned object co-occurrence could be exploited by observers to guide and facilitate their visual search in a manner similar to that seen in previous studies manipulating the relationship between a search target and the overall spatial structure of the scene. 
To test whether object co-occurrence can serve as a contextual cue to guide search in natural viewing conditions, participants completed a simple search task while their eye movements and search times were monitored by a mobile eye tracking setup that allowed free eye and head movements. The task consisted of locating small, low-visibility target objects (e.g., plastic fork, straw) that were paired in the display with larger, more visible “cue” objects that the target objects regularly co-occur with in natural scenes (e.g., plate with fork). This replication of real-life object co-occurrence served to introduce structured context into the search display. Context was manipulated by varying the proximity of target objects and their paired cue objects, thus modifying how effective of an indicator of target location the paired cue objects were. 
Since the cue objects were chosen to be highly visible to provide landmarks in guiding search, it is possible that any search guidance observed as a result of these cue objects may reflect their low-level visual properties (i.e., salience) rather than their contextual relation to the search at hand per se. To examine this possibility, the effects of low-level image saliency on eye movement selection were evaluated by analyzing photographs of the search display with two prominent models of visual saliency (Walther & Koch, 2006; Zhang, Tong, Marks, Shan, & Cottrell, 2008). The output of these models was compared with human fixation selection to determine if human search performance and eye movement selection could be better explained by low-level scene properties than our contextual manipulation. 
Methods
Participants
Participants were 24 undergraduate and graduate students (ages 18–25) at the University of California, Santa Barbara. All participants had normal or corrected-to-normal vision. Participants were paid or received course credit for their participation. 
Stimuli
Stimuli consisted of four search targets, four paired cue objects, and 33 distractor objects arranged on four elongated tables (see Figure 1). All stimuli were common office and household objects (e.g., stapler, book, CD). The four search targets were chosen for their small size and neutral colors (which matched the gray tabletops): fork, straw, headphones, and credit card. The paired cue objects, selected to be easily visible in the display, consisted of a plate, cup, iPod, and wallet, respectively (refer to Figure 2 for target–cue pairings). 
Figure 1
 
Layout of the search display from the viewpoint of the participant turning their head to either side. A search target (headphones) is shown in both its expected and unexpected locations. In the expected condition, the headphones are adjacent to their cue object (iPod; indicated as “Cue”), while in the unexpected condition the headphones are in an eccentricity-matched location across the midline, surrounded by unrelated distractors. For each participant, the display was static and each target only appeared in one of the two locations.
Figure 1
 
Layout of the search display from the viewpoint of the participant turning their head to either side. A search target (headphones) is shown in both its expected and unexpected locations. In the expected condition, the headphones are adjacent to their cue object (iPod; indicated as “Cue”), while in the unexpected condition the headphones are in an eccentricity-matched location across the midline, surrounded by unrelated distractors. For each participant, the display was static and each target only appeared in one of the two locations.
Figure 2
 
Search targets and their paired cue objects. Target and cue object pairs were chosen for their consistent spatial co-occurrence in natural scenes.
Figure 2
 
Search targets and their paired cue objects. Target and cue object pairs were chosen for their consistent spatial co-occurrence in natural scenes.
Each cue object and all non-target distractors had a stationary position in the display. One cue object appeared on each table. Search targets, however, could appear in one of two positions mirrored across the midline of the display. These two positions corresponded to whether or not the target was in an expected or unexpected location (i.e., adjacent to the cue or surrounded by unrelated objects; see Figure 1 for an example). The search targets appeared at a range of eccentricities with respect to the midline of the display (straw—16°, fork—27°, headphones—27°, credit card—51°). Participants viewed the display at a distance of approximately 5 ft from the center of the front two tables and 8 ft from the center of the back two tables. 
Search task
Participants were greeted by the experimenter outside of the laboratory and told that they would be taking part in a visual search study in which they were to locate objects of interest on tabletops. There was no mention of scene context or object co-occurrence. Upon entering the classroom in which the experiment took place, participants were fit with the eye tracker and performed a calibration procedure facing away from the search display. After successful calibration, participants were informed that they would be performing a small number of trials, each of which consisted of locating a single object on one of four elongated tables. 
Four trials of visual search were performed. For each participant, the target object in two trials appeared adjacent to its respective cue (expected condition), while the target in the remaining two trials was located in an eccentricity-matched location across the display midline surrounded by unrelated distractors (unexpected condition; see Figure 1). Target location and trial order were selected pseudorandomly to ensure that all target objects were the relevant target in all trial positions (1–4) and context conditions (expected and unexpected) approximately equally. The display was arranged prior to the arrival of each participant and remained static through the four search trials. 
To reduce memorization of the display across trials, all trials started in the dark with participants fixating on the beam of a laser pointer on the wall in front of them. Participants were instructed to hold fixation as long as the lights remained off. The beam was positioned such that initial gaze position was approximately head height and aligned with the midline of the display. While fixating the laser, participants were verbally informed what the search target would be on the upcoming trial and were asked to repeat it back to ensure the name of the target was heard correctly. After the name of the search target was repeated, the lights were turned on. When the experimenter said “go,” the participant was free to search the display. Participants were instructed to stand still while performing the search but were encouraged to move their head and eyes freely and naturally. After locating the target, participants fixated the object directly and said “I found it.” At that point, the lights were extinguished and a new trial started. An example of the trial structure can be seen in Movie 1
Upon completion of the session, participants were probed to see if they had happened to memorize the location of a target object before it was to be located. Any affirmative indication of memorization of a target location led to those trials being excluded from further analysis. Three trials from three different participants were discarded for this reason. 
Assessing saliency
To quantify image saliency, two prominent models of visual saliency (Walther & Koch, 2006; Zhang et al., 2008) were used to analyze full color photographs of the search display. Fourteen photographs of the display were taken from the participants' viewing distance at approximately head height with a Canon Digital Rebel XT camera. As the entire display could not be captured in one frame, photographs were taken of each side of the display and then merged together in Photoshop to create an image of the entire search display. To ensure that the model's assessment of salience was robust to changes in stimulus configuration and overall illumination, photographs of the display were taken for all experimentally used stimulus configurations and at multiple exposure times (ISO 100, f/3.5, exposure time: 1/10–1/40). 
The two salience models can be used to generate predicted eye movements in the search display based solely on stimulus properties, without knowledge of task demands. Walther and Koch's (2006) saliency model is in the public domain and can be found online (http://www.saliencytoolbox.net). This model identifies regions of high contrast, chromaticity, or edge density. Zhang et al.'s (2008) implementation of their free-viewing Saliency Using Natural statistics (SUN) model identifies salient regions based on statistics derived from natural images rather than the particular image being viewed, thus maximizing information sampling. It is important to note that both of these models fail to incorporate any mechanisms to simulate the foveated nature of the visual system and process all parts of stimulus images at the same level of acuity. 
Model eye movement predictions were generated by locating the top 5 most salient objects in the scene. To do so, the author (SM) hand coded which objects the top 5 most salient regions fell upon across all 14 images for each model. If one of the predicted points did not fall on one of the objects in the search display (e.g., a high-contrast segment of the wall), the number of high salience points predicted by the model was expanded until five objects were clearly selected. There was considerable agreement within each model across all images as to which objects were the most salient, suggesting that the models' predictions of saliency were robust to manipulations of stimulus configuration and overall luminance of the image. 
Monitoring gaze
Eye movements and search times were recorded using an Applied Science Laboratories (ASL) mobile eye tracker. The tracker, mounted on a pair of goggles, consists of two cameras: a scene camera coinciding with the observer's line of sight as well as a camera that records infrared corneal reflection for monocular right eye tracking at a sampling frequency of 30 Hz. The resultant track has previously been reported to have a maximum precision of approximately 1° of visual angle (Droll & Eckstein, 2009). 
Calibration of the eye tracker was accomplished by instructing each participant to fixate upon a series of objects designated by the experimenter while maintaining a stationary head position. Calibration was verified by instructing the participant to choose, without telling the experimenter, one of 15 small panels of a poster to fixate. If the experimenter was able to correctly guess which panel the participant was fixating, the calibration was deemed successful and the study proceeded. 
Videos from the scene camera overlaid with crosshairs indicating eye position were collected for each participant and stored for subsequent analysis (see Figure 3). Movie 1 provides an example of typical data collected from each participant. 
Figure 3
 
Sample screenshot of participant data video obtained from the mobile eye tracker. Red crosshairs indicate gaze position in the scene.
Figure 3
 
Sample screenshot of participant data video obtained from the mobile eye tracker. Red crosshairs indicate gaze position in the scene.
Data analysis
Search times and fixation locations were hand coded on a frame-by-frame basis from the output video of the mobile eye tracker. Search times were determined by extracting the time stamps, as provided by the eye tracker, from the start and end of the search. Visual search was operationalized as beginning with the frame before the first displacement from fixation and ending at the first frame of the final fixation of the target. For all participants, two mean search times were calculated, one for trials in which the target appeared in the expected location and one for targets in the unexpected location. 
The coding of individual fixations was performed by two research assistants (BB and ZF) naive to the experimental conditions or goals of the study. Eye movement data were only analyzed for trials in which the search target was in an unexpected location. Due to the close proximity of the target and cue object in the expected location, eye movements directed toward the cue object in the expected condition were difficult to disentangle from eye movements that were directed to the target itself. In contrast, the spatial separation of the target and cue objects in the unexpected condition provided a case in which preferential fixation of the cue object would result in eye movements directed to the side of the display opposite the target location, clearly dissociating cue and target-driven fixation patterns. 
For each unexpected trial, fixations were tabulated on an object-by-object basis. Fixations were defined as consisting of two consecutive frames (66-ms duration) of gaze at a stationary location. Fixation counts were averaged across participants' unexpected trials to yield five values that corresponded to the mean number of fixations per trial for: (1) the relevant cue (e.g., fixations on wallet when credit card was the search target), (2) the irrelevant cues (e.g., fixations on iPod/plate/cup when credit card was search target), (3) all other distractors in the display, and (4–5) the most fixated object of the top 5 most salient objects predicted by Walther and Koch's (2006) and Zhang et al.'s (2008) models. The values of these five calculations can be understood as the mean number of fixations per trial for a single object in that category (i.e., calculations account for the number of objects within a given category). Results reported below regarding fixation data were consistent across the two raters in all but one comparison. The actual numerical values reported in the main text are for rater BB, while results from ZF can be found in 1
Illumination from overhead lights as well as variations in corneal reflectivity sometimes resulted in a loss of a reliable eye trace. If a reliable track was unavailable for a participant or trial, these data were excluded from further analysis. As a result, of the 96 possible trials (24 participants × 4 trials each), 81 were included in the final analyses. 
Results
Search times
Cue objects were selected for their consistent spatial co-occurrence with their respective target object in natural scenes. Therefore, the presence of the cue object provided a strong probabilistic cue as to the location of the target. If participants were to exploit this statistical knowledge to guide their visual search, they should be able to achieve shorter search times for targets in expected locations as compared to unexpected locations. Across search targets, this was the case. Search times were significantly shorter for targets in expected (M = 2.06 s) as compared to unexpected (M = 2.58 s) locations, t(23) = −1.97, p = 0.03. This nearly 20% reduction in search time for targets in expected locations is consistent with the prediction that the cue objects could provide strong guidance toward the location of the target object based on statistical structure of natural scenes. 
Eye movements in unexpected condition
Although the decrease in search times for targets in expected locations suggests that the target may be easier to locate when in close spatial proximity to the cue object, it is unclear how participants achieved this search time benefit. Since the cue object provided a probabilistic cue as to the location of the target, we investigated whether participants' patterns of eye movements were biased toward the cue object in an attempt to exploit this statistical cue. 
Consistent with the prediction that the eye movements would be biased toward the cue object, participants directed a larger number of fixations per trial toward the relevant cue object (e.g., plate when the search target was fork; M = 0.63) than other distractor objects (M = 0.18) while searching for a target in an unexpected location, t(23) = 3.21, p < 0.002 (see Figure 4). Of the 327 fixations made by participants over the course of 41 unexpected trials, 27 (8%) were directed to the cue object relevant to the particular trial. That is over 3 times as many fixations than would be expected if eye movements were distributed uniformly across the 41 objects in the display (∼2.4% of total fixations per object). This preferential deployment of eye movements to the cue objects appears to indicate that participants had knowledge of cue–target co-occurrence in natural scenes and were able to effectively use that information to guide the course of visual search. 
Figure 4
 
Eye movement selection in the unexpected condition. The figure depicts mean number of fixations for the relevant cue object, average of all other distractors in the display, and the distractor that garnered the largest number of fixations (i.e., maximally fixated distractor), separated by search target. While the relevant cue object garnered an above average number of fixations, it is important to notice that it was only the maximally fixated distractor for one out of the four search targets (fork). In fact, the maximally fixated distractor often appeared to share basic features with the search target (i.e., shapes of the credit card and iPod are quite similar). Error bars represent one SEM.
Figure 4
 
Eye movement selection in the unexpected condition. The figure depicts mean number of fixations for the relevant cue object, average of all other distractors in the display, and the distractor that garnered the largest number of fixations (i.e., maximally fixated distractor), separated by search target. While the relevant cue object garnered an above average number of fixations, it is important to notice that it was only the maximally fixated distractor for one out of the four search targets (fork). In fact, the maximally fixated distractor often appeared to share basic features with the search target (i.e., shapes of the credit card and iPod are quite similar). Error bars represent one SEM.
Although participants did often direct eye movements toward the relevant cue in the unexpected condition, the cue object was not necessarily the non-target object that garnered the largest amount of fixations. In fact, for 3 out of the 4 search targets, another non-target object in the display received the most fixations. Figure 4 shows that while the relevant cue object received more fixations on average than the mean of all distractors, the distractor object that garnered the most fixations often shared featural properties with the target, such as shape, size, and color (e.g., iPod receiving the largest amount of fixations when credit card was the search target). While in a qualitative assessment, it appears as though participants were utilizing both scene context and expected target properties to guide their visual search, mirroring results of recent research (Malcolm & Henderson, 2010). 
Fixation selection modulated by task relevance of cue
Although participants directed significantly more eye movements toward the relevant cue object than to the other distractors as a whole, it is possible that this pattern of results could have arisen due to the high visibility of the cue objects. Thus, the selective fixation of the relevant cue objects could reflect their salience and not necessarily their spatial contextual relationship with the target. To investigate this hypothesis, we calculated the average number of fixations for cue objects when they were contextually relevant (e.g., plate when fork was the search target) compared to when they were contextually irrelevant (e.g., plate when headphones was the search target). Cue objects in the unexpected condition received more fixations per trial when they were contextually relevant (M = 0.63) than when they were contextually irrelevant (M = 0.30), t(23) = 2.15, p = 0.02 (see leftmost two bars of Figure 5). Over the 41 unexpected trials, 38 of the 327 total fixations were directed toward the contextually irrelevant cue objects, as compared to 27 for the contextually relevant cue. Although the contextually irrelevant cues garnered slightly more fixations overall, there were also three irrelevant cues to only one relevant cue per trial. As such, contextually irrelevant cues actually received much fewer fixations than the contextually relevant cues (less than half) on a per-object basis. While contextually irrelevant cue objects did garner more fixations than the average of all the other distractors (t(23) = 2.96, p = 0.004), it appears that it was the contextual information in the cue objects and not solely their visibility that led participants to preferentially fixate them during search. 
Figure 5
 
Mean number of fixations for cue objects when contextually relevant, cue objects when contextually irrelevant, the average of all of the distractor objects, and the top 5 most salient object for each salience model that garnered the most fixations. Error bars represent one SEM.
Figure 5
 
Mean number of fixations for cue objects when contextually relevant, cue objects when contextually irrelevant, the average of all of the distractor objects, and the top 5 most salient object for each salience model that garnered the most fixations. Error bars represent one SEM.
Effects of saliency on fixation selection
Converging evidence that the preferential fixation of the cue objects was due to their contextual relevance and not solely their low-level salience was provided by applying two prominent models of visual saliency (Walther & Koch, 2006; Zhang et al., 2008) to images of the search display to determine the top 5 most salient objects in the display. Of these top 5 objects generated by each saliency model, we selected the object for each that garnered the most fixations in the observer data and compared these to the amount of fixations garnered by cue objects when contextually relevant, cue objects when contextually irrelevant, and the average fixations garnered by all of the distractors as a whole. The decision to analyze the object of the top 5 most salient from each model that garnered the most fixations in the observer data (as opposed to the average fixations over all of the top 5 objects or some other analysis) was done to give the saliency models the greatest chance possible of explaining fixation selection. However, as seen in Figure 5, the objects chosen by the two models of visual saliency (Walther & Koch, 2006: M = 0.31; Zhang et al., 2008: M = 0.19) garnered significantly fewer fixations per trial than contextually relevant cue objects (M = 0.63, t(23) = 1.93, p = 0.03; t(23) = 3.22, p = 0.002). In fact, the number of eye movements directed toward these two objects selected from the top 5 of the two saliency models across the 41 unexpected trials individually (Walther & Koch, 2006, top: 13, Zhang et al., 2008, top: 9) did not even reach half the number of fixations received by the relevant cue object over those same trials (27). Additionally, there were no significant differences between the number of fixations per trial garnered by these two “salient” objects and irrelevant cues (M = 0.30; t(23) = −0.51, p = 0.61; t(23) = 0.14, p = 0.89) or the average of all of the other distractors in the display (M = 0.18; t(23) = 1.70, p = 0.051; t(23) = 0.17, p = 0.87). 2 These results clearly suggest that saliency, by itself, was not an effective predictor of fixation location in the current study. 
Discussion
Previous work has established that scene context, defined by the overall spatial structure of the scene, can facilitate visual search in both natural and synthetic images presented on a computer screen, as evidenced by reduced search time (e.g., Brockmole & Henderson, 2006a; Chun & Jiang, 1998; Hidalgo-Sotelo et al., 2005; Hollingworth, 2009) and more efficient eye movements (e.g., Eckstein et al., 2006; Malcolm & Henderson, 2010; Neider & Zelinsky, 2006; Oliva et al., 2003; Torralba et al., 2006). However, it has not yet been demonstrated whether these effects translate to natural search tasks and environments as well as alternative conceptualizations of scene context. Of particular interest was exploring whether the restricted size of the search display or forcing the head to remain stationary during search (as seen in prior research) produced results that were atypical of those found in natural visual search, as both represent unnatural conditions. Additionally, we examined whether contextual information in the form of object co-occurrence could provide search guidance similar to that seen in previous literature in which scene context is defined by the global spatial configuration of the scene. 
In direct accord with previous research, we found a substantial reduction in search time when search targets were located in expected as opposed to unexpected locations. Such facilitation of visual search performance likely reflects the ability of observers to interpret the context of the search display and exploit learned spatial relationships as defined by natural object co-occurrence to efficiently guide eye movements to likely target locations. This interpretation was strengthened by the finding that participants were disproportionately more likely to fixate the relevant cue object (e.g., iPod when headphones were the search target) than the rest of the distractors in the scene while performing search. These results mirror previous work that shows that participants direct their eye movements toward expected locations in a scene (Droll & Eckstein, 2008; Eckstein et al., 2006; Ehinger et al., 2009; Malcolm & Henderson, 2010; Neider & Zelinsky, 2006; Oliva et al., 2003; Torralba et al., 2006). Although the cue objects were selected with the intent of being easily visible in hope that they would be readily available to direct visual search, converging evidence from eye movement data and the assessment of search display saliency suggested that the eye movement guidance afforded by these cue objects was largely due to their contextual relevance to the task at hand and not their low-level salience. 
Quite striking is that these results were obtained with merely four trials per participant and no mention whatsoever of scene context or object co-occurrence. In fact, no instructions were given as to how to carry out the search except that it should be completed as quickly as possible. Additionally, the short exposure to the relatively large display minimized the amount of learning that could be done regarding the spatial positions of the 40+ objects contained within it. Therefore, any information that participants had about the spatial relationships between the objects in the display almost certainly had to be developed prior to participating in the study. Thus, our results seem to reflect a true tendency for observers to extract knowledge about the spatial relationships of objects in everyday scenes and exploit them to promote efficient search. 
Saliency
While the current study focused on the effects of scene context, the potential influence of low-level salience cannot be discounted. Although there are demonstrations of salience being at least somewhat predictive of fixation selection (Bruce & Tsotsos, 2009; Itti & Koch, 2000; Parkhurst et al., 2002; Tatler et al., 2006), it is well documented that utilizing salience as a method of predicting eye movements breaks down in naturalistic environments and everyday tasks (e.g., Hayhoe et al., 2003; Henderson, 2003; Henderson et al., 2007; Land & Hayhoe, 2001; Tatler, 2009). The current work falls in line with the latter research, as model predictions of saliency failed to account for the distribution of fixations during our search task, particularly for cue objects. In a more general sense, however, saliency was sure to play at least some role. For instance, cue objects were chosen to be easily visible in order to be effective landmarks in guiding search. If the cue objects had been as difficult to detect as the targets, it is likely that we may not have been able to demonstrate any facilitation of visual search via the cues, since they could not have been effectively located and exploited. In that regard, an understanding of the bottom-up properties of the objects in the display is critical to exploring why and how these effects may come about. 
Target features
Our eye movement analyses also suggested that participants may have directed their eye movements toward the location of the scene that looked like the target, a well-established finding in the literature (Beutter et al., 2003; Eckstein et al., 2007, 2000; Findlay, 1997; Malcolm & Henderson, 2010; Rajashekar et al., 2006; Rao et al., 2002; Tavassoli et al., 2007; Zelinsky, 2008). Specifically, the maximally fixated distractor for three out of the four search targets was not the relevant cue but some other object that seemed to share feature-level properties with the target (see Figure 4). This seems to suggest that participants, in addition to exploiting scene context, were also actively searching for objects that matched the likely appearance of the target. Since exploiting knowledge about the expected appearance of the target provides information to help guide search that scene context largely cannot give, it is reasonable to expect that observers would engage both of these strategies to produce efficient search. In fact, contemporary Bayesian eye movement models contain explicit terms for both target appearance and scene context (Droll et al., 2009; Eckstein et al., 2006; Ehinger et al., 2009; Torralba, 2003; Torralba et al., 2006). Additionally, recent behavioral work illustrates how both of these factors are capable of simultaneously influencing search performance (Castelhano & Heaven, 2010; Malcolm & Henderson, 2010). 
Object co-occurrence as scene context
Defining scene context by local object co-occurrence was an admittedly atypical choice in relation to much of the context literature, although it has been established as a viable conceptualization of scene context (Bar, 2004; Davenport, 2007). Most studies examining the effects of spatial context on search in natural images tend to define context by the overall structure or category of the scene and not the specific object interactions within them (e.g., Castelhano & Heaven, 2010; Droll & Eckstein, 2008; Ehinger et al., 2009; Hidalgo-Sotelo et al., 2005; Oliva et al., 2003; Torralba et al., 2006). Although this is one way of conceptualizing the spatial information contained within a scene, scenes are inherently defined by the objects within them and their spatial and statistical relationships. Object co-occurrence provides information about both the likelihood of an object appearing in a scene as well as its location. In this sense, it appears that the information provided by overall scene structure and object co-occurrence could, in fact, overlap quite a bit. Moreover, depending on how one chooses to define what exactly constitutes an “object,” the line between the two definitions of scene context becomes less clear. Take, for example, the task of locating pedestrians in a street scene. Nearly everyone would direct their search for pedestrians on the street or sidewalk and not on the sides of buildings or the sky. Such results seem to be a direct result of considering the overall structure of the scene. However, is the street an object? Is the sky an object? Did identifying the sidewalk activate the observer's knowledge that pedestrians are likely to be found on sidewalks? While object co-occurrence may seem to convey quite different information about spatial context than the overall layout of the scene, these two definitions of scene context may, in fact, be providing the observer overlapping stochastic knowledge. 
Disentangling semantic association from scene context
We interpreted our current findings as being the result of observers exploiting contextually defined spatial relationships between naturally co-occurring objects to guide search. However, our data may also reflect that our observers' search was being guided by objects that simply share a semantic relationship with the target. While our target and cue objects did tend to co-occur spatially in natural scenes, they were also highly semantically related. To determine if the presence of the cue provided an explicit expectation as to the likely location of the target, 30 observers participated in a brief experiment in which they were shown an image of the search display and asked to click a single location in the scene where they would most likely expect to find each search target. The image of the search display depicted the scene as it was for participants in the main experiment with the search targets having been removed with Photoshop. As seen in Figure 6, participants clearly used the relevant cue object to determine their judgments of expected target location. For each search target, click responses of expected target location were densely concentrated around their respective cue object, with only a few clicks straying elsewhere in the display. 
Figure 6
 
Normalized click densities for expected location of each target object. Each click was convolved with a 2D Gaussian whose variability was determined by the standard deviation of the click response coordinates for each object. Resultant images were normalized with respect to the largest value across all four images. For each search target, click densities clearly peak at the location of their respective cue object (location indicated by white arrow).
Figure 6
 
Normalized click densities for expected location of each target object. Each click was convolved with a 2D Gaussian whose variability was determined by the standard deviation of the click response coordinates for each object. Resultant images were normalized with respect to the largest value across all four images. For each search target, click densities clearly peak at the location of their respective cue object (location indicated by white arrow).
Despite this evidence that the cue objects provided strong spatial information about the target's expected location, the possibility still exist that observers may act as though anything semantically related to the target may predict its location, regardless of whether the two objects have a strong spatial association in natural scenes. As our current data do not allow us to definitively rule such a hypothesis out, future research with controlled manipulations of semantic association and spatial co-occurrence are necessary to dissociate the contributions of these two factors in search performance. 
Limitations
The purpose of the current research was to explore whether previously reported findings regarding the effect of scene context on visual search hold in naturalistic viewing conditions and extend to an alternative definition of scene context. We believe our results provide a compelling preliminary indication that these effects do, in fact, reflect search strategies employed in daily visual experience. However, working with physical stimuli imposes unique time and organizational constraints, which unfortunately restricted the scope of what we could examine in this study. 
First, the use of physical stimuli restricted the number of conditions we could efficiently run. In contrast to studies in which stimuli are presented on computer monitors, the search display in our task could not be rearranged within a reasonable amount of time between trials, limiting us to a static search display. Thus, to keep memory effects minimal and subject numbers manageable, we were only able to test with four search targets (and also 4 trials of visual search), each tied to a fixed eccentricity. Therefore, attempting to evaluate the effect of eccentricity on visual search facilitation or examine the magnitude of the effect on a by-object basis was implausible, as the effects of search target identity and eccentricity were inseparable. 
The use of everyday objects as stimuli also forced us to relinquish some of the experimental control that is characteristic of computer-based vision research. While we chose stimuli, especially the targets and cues, with specific goals in mind regarding their size and visibility, there was still a great amount of variability in the appearance of the stimuli that was beyond our experimental control. Additionally, target and cue objects were chosen for their consistent spatial co-occurrence in natural scenes. However, the exact statistics of how often each cue–target pair co-occurs or how well one predicts the location of the other in a scene is, like many statistical properties of natural scenes, still poorly understood. Given the goal of the study was to focus on the effect of context in a natural environment (using commonly occurring objects), this tight control of stimulus properties was knowingly conceded in order to simulate a realistic search task. 
Conclusions and future directions
The current study provided evidence that spatial context (in this case, object co-occurrence information) learned through normal visual experience can facilitate the speed of visual search as well as guide eye movement selection in a natural setting for a simple, naturalistic search task. To our knowledge, this is the first such demonstration of these effects in a natural viewing environment and with our particular conceptualization of scene context. As such, this study provides a solid starting point from which to begin exploring how knowledge of object co-occurrence information can modulate the course of visual search in natural environments. Of particular importance in future research is precisely defining the statistics of natural scenes to more fully understand how such information could be utilized by human observers to accomplish efficient visual search. 
Supplementary Materials
Supplementary Movie - Supplementary Movie 
Movie 1. The movie contains 3 trials of raw data from the mobile eye tracker. At the beginning of each trial, which starts in the dark, participants fixate a red laser pointer until told the name of the search target to be found in the upcoming trial. When the lights come on and the experimenter says “go,” the participant is free to move their head and eyes until they locate the search target on the tables in front of them. The title before each trial states the search target and whether it is in the expected or unexpected location. The red crosshair indicates gaze position. 
Appendix A
In this section, we present the results regarding the eye movement data for our other rater (ZF). Although the numerical values for ZF differ slightly from those of BB (presented in the main text and here, in parentheses), all statistical tests but one provide the same results (Table A1). 
Table A1
 
Results of eye movement analyses for rater ZF. For comparison, rater BB's values are presented in parentheses (as well as in the main text). Note that all statistical comparisons but one yield the same result.
Table A1
 
Results of eye movement analyses for rater ZF. For comparison, rater BB's values are presented in parentheses (as well as in the main text). Note that all statistical comparisons but one yield the same result.
Fixation counts across all 41 “unexpected” trials
Total fixations Cue relevant Cue irrelevant All other distractors Walther and Koch (2006) top Zhang et al. (2008) top
236 (327) 23 (27) 34 (38) 213 (300) 9 (13) 11 (9)
 
Descriptive statistics
n M SD
Cue relevant 24 0.54 (0.63) 0.62 (0.72)
Cue irrelevant 24 0.28 (0.30) 0.20 (0.24)
All other distractors 24 0.13 (0.18) 0.07 (0.10)
Walther and Koch (2006) top 24 0.23 (0.31) 0.36 (0.38)
Zhang et al. (2008) top 24 0.25 (0.19) 0.36 (0.32)
 
Statistical comparisons
t df p (one-tailed)
Relevant > all other distractors 3.31 (3.21) 23 0.002* (0.002*)
Relevant > irrelevant 2.20 (2.15) 23 0.02* (0.02*)
Irrelevant > all other distractors 4.26 (2.96) 23 < 0.001* (0.004*)
Relevant > Walther and Koch (2006) top 2.01 (1.93) 23 0.03* (0.03*)
Relevant > Zhang et al. (2008) top 2.17 (3.22) 23 0.02* (0.002*)
Walther and Koch (2006) top > irrelevant −0.51 (0.14) 23 0.61 (0.89)
Zhang et al. (2008) top > irrelevant −0.38 (−1.53) 23 0.71 (0.93)
Walther and Koch (2006) top > all other distractors 1.30 (1.70) 23 0.10 (0.051)
Zhang et al. (2008) top > all other distractors 1.73 (0.17) 23 0.05* (0.87)
 

Note: *p < 0.05.

Acknowledgments
Support for this research was provided by the National Science Foundation (NSF-0819582). We thank Britt Bender and Zach Flood for their assistance with data collection and analysis. Portions of this work were previously presented at the Vision Sciences Society Annual Meeting (Mack, Schoonveld, & Eckstein, 2009). 
Commercial relationships: none. 
Corresponding author: Stephen C. Mack. 
Email: mack@psych.ucsb.edu. 
Address: Vision and Image Understanding Laboratory, Department of Psychological and Brain Sciences, University of California Santa Barbara, Santa Barbara, CA, 93106, USA. 
Footnotes
Footnotes
1  It is worth noting that the above studies examined the effect of scene context on visual search utilizing expected and unexpected target locations without actively manipulating semantic consistency. While previous research has shown that the semantic relationship between a search target and its surroundings can impact search quite substantially (e.g., Becker, Pashler, & Lubin, 2007; Bonitz & Gordon, 2008; Brockmole & Henderson, 2008; De Graef, Christiaens, & d'Ydewalle, 1990; Henderson, Weeks, & Hollingworth, 1999; Loftus & Mackworth, 1978; Underwood & Foulsham, 2006; Vo & Henderson, 2009), in the studies discussed in the main text, search targets were all semantically consistent with their surrounding context.
Footnotes
2  However, data from rater ZF did indicate that the top 5 most fixated object for Zhang et al. (2008) did receive significantly more fixations per trial than the mean of all other distractors (t(23) = 1.73, p = 0.05).
References
Bar M. (2004). Visual objects in context. Nature Reviews Neuroscience, 5, 617–629. [CrossRef] [PubMed]
Becker M. W. Pashler H. Lubin J. (2007). Object-intrinsic oddities draw early saccades. Journal of Experimental Psychology: Human Perception and Performance, 33, 20–30. [CrossRef] [PubMed]
Beutter B. R. Eckstein M. P. Stone L. S. (2003). Saccadic and perceptual performance in visual search tasks: I. Contrast detection and discrimination. Journal of the Optical Society of America, 20, 1341–1355. [CrossRef] [PubMed]
Bonitz V. S. Gordon R. D. (2008). Attention to smoking-related and incongruous objects during scene viewing. Acta Psychologica, 129, 255–263. [CrossRef] [PubMed]
Brady T. F. Chun M. M. (2007). Spatial constraints on learning in visual search: Modeling contextual cuing. Journal of Experimental Psychology: Human Perception and Performance, 33, 798–815. [CrossRef] [PubMed]
Brockmole J. R. Castelhano M. S. Henderson J. M. (2006). Contextual cueing in naturalistic scenes: Global and local contexts. Journal of Experimental Psychology, 32, 699–706. [PubMed]
Brockmole J. R. Hambrick D. Z. Windisch D. J. Henderson J. M. (2008). The role of meaning in contextual cueing: Evidence from chess expertise. Quarterly Journal of Experimental Psychology, 61, 1886–1896. [CrossRef]
Brockmole J. R. Henderson J. M. (2006a). Recognition and attention guidance during contextual cueing in real-world scenes: Evidence from eye movements. Quarterly Journal of Experimental Psychology, 59, 1177–1187. [CrossRef]
Brockmole J. R. Henderson J. M. (2006b). Using real-world scenes as contextual cues for search. Visual Cognition, 13, 99–108. [CrossRef]
Brockmole J. R. Henderson J. M. (2008). Prioritizing new objects for eye fixation in real-world scenes: Effects of object-scene consistency. Visual Cognition, 16, 375–390. [CrossRef]
Bruce N. D. B. Tsotsos J. K. (2009). Saliency, attention, and visual search: An information theoretic approach. Journal of Vision, 9, (3):5, 1–24, http://www.journalofvision.org/ontent/9/3/5, doi:10.1167/9.3.5. [PubMed] [Article] [CrossRef] [PubMed]
Castelhano M. S. Heaven C. (2010). The relative contribution of scene context and target features to visual search in scenes. Attention, Perception, & Psychophysics, 72, 1283–1297. [CrossRef]
Castelhano M. S. Henderson J. M. (2007). Initial scene representations facilitate eye movement guidance in visual search. Journal of Experimental Psychology: Human Perception and Performance, 33, 753–763. [CrossRef] [PubMed]
Chun M. M. (2000). Contextual cueing of visual attention. Trends in Cognitive Sciences, 4, 170–178. [CrossRef] [PubMed]
Chun M. M. Jiang Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36, 28–71. [CrossRef] [PubMed]
Chun M. M. Jiang Y. (1999). Top-down attentional guidance based on implicit learning of visual covariation. Psychological Science, 10, 360–365. [CrossRef]
Chun M. M. Jiang Y. (2003). Implicit, long-term spatial contextual memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 224–234. [CrossRef] [PubMed]
Davenport J. L. (2007). Consistency effects between objects in scenes. Memory & Cognition, 35, 393. [CrossRef] [PubMed]
De Graef C. Christiaens D. d'Ydewalle G. (1990). Perceptual effects of scene context on object identification. Psychological Research, 52, 317–329. [CrossRef] [PubMed]
Droll J. A. Abbey C. K. Eckstein M. P. (2009). Learning cue validity through performance feedback. Journal of Vision, 9, (2):18, 1–22, http://www.journalofvision.org/content/9/2/18, doi:10.1167/9.2.18. [PubMed] [Article] [CrossRef] [PubMed]
Droll J. A. Eckstein M. P. (2008). Expected object position of two hundred fifty observers predicts first fixations of seventy seven separate observers during search [Abstract]. Journal of Vision, 8, (6):320, 320a, http://www.journalofvision.org/content/8/6/320, doi:10.1167/8.6.320. [CrossRef]
Droll J. A. Eckstein M. P. (2009). Gaze control, change detection and the selective storage of object information while walking in a real world environment. Visual Cognition, 17, 1159–1184. [CrossRef]
Eckstein M. P. Beutter B. R. Pham B. T. Shimozaki S. S. Stone L. S. (2007). Similar neural representations of the target for saccades and perception during search. Journal of Neuroscience, 27, 1266–1270. [CrossRef] [PubMed]
Eckstein M. P. Drescher B. A. Shimozaki S. S. (2006). Attentional cues in real scenes, saccadic targeting, and Bayesian priors. Psychological Science, 17, 973–980. [CrossRef] [PubMed]
Eckstein M. P. Thomas J. P. Palmer J. Shimozaki S. S. (2000). A signal detection model predicts the effects of set size on visual search accuracy for feature, conjunction, triple conjunction, and disjunction displays. Perception & Psychophysics, 62, 425–451. [CrossRef] [PubMed]
Ehinger K. A. Brockmole J. R. (2008). The role of color in visual search in real-world scenes: Evidence from contextual cueing. Perception & Psychophysics, 70, 1366–1378. [CrossRef] [PubMed]
Ehinger K. A. Hidalgo-Sotelo B. Torralba A. Oliva A. (2009). Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition, 17, 945–978. [CrossRef] [PubMed]
Findlay J. M. (1997). Saccade target selection during visual search. Vision Research, 37, 617–731. [CrossRef] [PubMed]
Gajewski D. A. Pearson A. M. Mack M. L. Bartlett F. N. Henderson J. M. (2005). Human gaze control in real world search. In Paletta L. Tsotsos, J. K. Rome E. Humphreys G. (Eds.), Attention and performance in computational vision (pp. 83–99). Heidelberg, Germany: Springer-Verlag.
Hayhoe M. Ballard D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194. [CrossRef] [PubMed]
Hayhoe M. M. Shrivastava A. Mruczek R. Pelz J. B. (2003). Visual memory and motor planning in a natural task. Journal of Vision, 3, (1):6, 49–63, http://www.journalofvision.org/content/3/1/6, doi:10.1167/3.1.6. [PubMed] [Article] [CrossRef] [PubMed]
Henderson J. M. (2003). Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7, 498–504. [CrossRef] [PubMed]
Henderson J. M. Brockmole J. R. Castelhano M. S. Mack M. L. (2007). Visual saliency does not account for eye movements during visual search in real-world scenes. In van R. Fischer M. Murray W. Hill R. W. (Eds.), Eye movements: A window on mind and brain (pp. 537–562). Amsterdam, The Netherlands: Elsevier.
Henderson J. M. Malcolm G. L. Schandl C. (2009). Searching in the dark: Cognitive relevance drives attention in real-world scenes. Psychonomic Bulletin & Review, 16, 850. [CrossRef] [PubMed]
Henderson J. M. Weeks P. A. Hollingworth A. (1999). The effects of semantic consistency on eye movements during complex scene viewing. Journal of Experimental Psychology, 25, 210–228.
Hidalgo-Sotelo B. Oliva A. Torralba A. (2005). Human learning of contextual priors for object search: Where does the time go? In Proceedings of the 3rd Workshop on Attention and Performance in Computer Vision, Int. CVPR. Washington, DC: IEEE Computer Society.
Hollingworth A. (2009). Two forms of scene memory guide visual search: Memory for scene context and memory for the binding of target object to scene location. Visual Cognition, 17, 273–291. [CrossRef]
Itti L. Koch C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. [CrossRef] [PubMed]
Itti L. Koch C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2, 1–11. [CrossRef]
Itti L. Koch C. Neibur E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [CrossRef]
Jiang Y. Wagner L. C. (2004). What is learned in spatial contextual cuing—Configuration or individual locations? Perception & Psychophysics, 66, 454–463. [CrossRef] [PubMed]
Koch C. Ullman S. (1985). Selecting one among the many: A simple network implementing shifts in selective visual attention. Human Neurobiology, 4, 219–227. [PubMed]
Kowler E. Pizlo Z. Zhu G. Erkelens C. J. Steinman R. M. Collewijn H. (1992). Coordination of head and eyes during the performance of natural (and unnatural) visual tasks. In Berthoz A. Graf W. Vidal P. P. (Eds.), The head–neck sensory motor system. Oxford: Oxford University Press.
Lamy D. Zoaris L. (2009). Task-irrelevant stimulus salience affects visual search. Vision Research, 49, 1472–1480. [CrossRef] [PubMed]
Land M. (2004a). Eye movements in daily life. In Chalupa L. Werner J. (Eds.), The visual neurosciences (vol. 2, pp. 1357–1368). Cambridge, MA: MIT Press.
Land M. F. (2004b). The coordination of rotations of the eyes, head and trunk in saccadic turns produced in natural situations. Experimental Brain Research, 159, 151–160. [CrossRef]
Land M. F. Hayhoe M. (2001). In what way do eye movements contribute to everyday activities? Vision Research, 41, 3559–3565. [CrossRef] [PubMed]
Loftus G. R. Mackworth N. H. (1978). Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human Perception and Performance, 4, 565–572. [CrossRef] [PubMed]
Mack S. C. Schoonveld W. Eckstein M. P. (2009). Contextual cues facilitate visual search in real world 3-D environments [Abstract]. Journal of Vision, 9, (8):1215, 1215a, http://www.journalofvision.org/content/9/8/1215, doi:10.1167/9.8.1215. [CrossRef]
Malcolm G. M. Henderson J. M. (2010). Combining top-down processes to guide eye movements during real-world scene search. Journal of Vision, 10, (2):4, 1–11, http://www.journalofvision.org/content/10/2/4, doi:10.1167/10.2.4. [PubMed] [Article] [CrossRef] [PubMed]
Najemnik J. Geisler W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434, 387–391. [CrossRef] [PubMed]
Neider M. B. Zelinsky G. J. (2006). Scene context guides eye movements during visual search. Vision Research, 46, 614–621. [CrossRef] [PubMed]
Oliva A. Torralba A. Castelhano M. S. Henderson J. M. (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, 1, 253–256.
Olson I. R. Chun M. M. (2002). Perceptual constraints on implicit learning of spatial context. Visual Cognition, 9, 273–302. [CrossRef]
Parkhurst D. Law K. Neibur E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42, 107–123. [CrossRef] [PubMed]
Peterson M. S. Kramer A. F. (2001). Attentional guidance of the eyes by contextual information and abrupt onsets. Perception & Psychophysics, 63, 1239–1249. [CrossRef] [PubMed]
Rajashekar U. Bovik A. C. Cormack L. K. (2006). Visual search in noise: Revealing the influence of structural cues by gaze-contingent classification image analysis. Journal of Vision, 6, (4):7, 379–386, http://www.journalofvision.org/content/6/4/7, doi:10.1167/6.4.7. [PubMed] [Article] [CrossRef]
Rao R. P. N. Zelinsky G. J. Hayhoe M. M. Ballard D. H. (2002). Eye movements in iconic visual search. Vision Research, 42, 1447–1463. [CrossRef] [PubMed]
Rosenholtz R. (1999). A simple saliency model predicts a number of motion popout phenomena. Vision Research, 39, 3157–3163. [CrossRef] [PubMed]
Steinman R. M. (2003). Gaze control under natural conditions. In Calupa L. M. Werner J. S. (Eds.), The visual neurosciences (pp. 1339–1356). Cambridge, MA: MIT Press.
Tatler B. W. (2009). Current understanding of eye guidance. Visual Cognition, 17, 777–789. [CrossRef]
Tatler B. W. Baddeley R. J. Gilchrist I. D. (2006). Visual correlates of fixation selection: Effects of scale and time. Vision Research, 45, 643–659. [CrossRef]
Tavassoli A. van der Linde I. Bovik A. C. Cormack L. K. (2007). An efficient technique for revealing visual search strategies with classification images. Perception & Psychophysics, 69, 103–112. [CrossRef] [PubMed]
Torralba A. (2003). Modeling global scene factors in attention. Journal of the Optical Society of America, 20, 1407–1418. [CrossRef] [PubMed]
Torralba A. Oliva A. Castelhano M. S. Henderson J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Review, 113, 766–786. [CrossRef] [PubMed]
Underwood G. Foulsham T. (2006). Visual saliency and semantic in congruency influence eye movements when inspecting pictures. Quarterly Journal of Experimental Psychology, 59, 1931–1949. [CrossRef]
Vo M. L. Henderson J. M. (2009). Does gravity matter Effects of semantic and syntactic inconsistencies on the allocation of attention during scene perception. Journal of Vision, 9, (3):24, 1–15, http://www.journalofvision.org/content/9/3/24, doi:10.1167/9.3.24. [PubMed] [Article] [CrossRef] [PubMed]
Walther D. Koch C. (2006). Saliency Toolbox 2.0. http://saliencytoolbox.net.
Zelinsky G. J. (2008). A theory of eye movements during target acquisition. Psychological Review, 115, 787–835. [CrossRef] [PubMed]
Zhang L. Tong M. H. Marks T. K. Shan H. Cottrell G. W. (2008). SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision, 8, (7):32, 1–20, http://www.journalofvision.org/content/8/7/32, doi:10.1167/8.7.32. [PubMed] [Article] [CrossRef] [PubMed]
Figure 1
 
Layout of the search display from the viewpoint of the participant turning their head to either side. A search target (headphones) is shown in both its expected and unexpected locations. In the expected condition, the headphones are adjacent to their cue object (iPod; indicated as “Cue”), while in the unexpected condition the headphones are in an eccentricity-matched location across the midline, surrounded by unrelated distractors. For each participant, the display was static and each target only appeared in one of the two locations.
Figure 1
 
Layout of the search display from the viewpoint of the participant turning their head to either side. A search target (headphones) is shown in both its expected and unexpected locations. In the expected condition, the headphones are adjacent to their cue object (iPod; indicated as “Cue”), while in the unexpected condition the headphones are in an eccentricity-matched location across the midline, surrounded by unrelated distractors. For each participant, the display was static and each target only appeared in one of the two locations.
Figure 2
 
Search targets and their paired cue objects. Target and cue object pairs were chosen for their consistent spatial co-occurrence in natural scenes.
Figure 2
 
Search targets and their paired cue objects. Target and cue object pairs were chosen for their consistent spatial co-occurrence in natural scenes.
Figure 3
 
Sample screenshot of participant data video obtained from the mobile eye tracker. Red crosshairs indicate gaze position in the scene.
Figure 3
 
Sample screenshot of participant data video obtained from the mobile eye tracker. Red crosshairs indicate gaze position in the scene.
Figure 4
 
Eye movement selection in the unexpected condition. The figure depicts mean number of fixations for the relevant cue object, average of all other distractors in the display, and the distractor that garnered the largest number of fixations (i.e., maximally fixated distractor), separated by search target. While the relevant cue object garnered an above average number of fixations, it is important to notice that it was only the maximally fixated distractor for one out of the four search targets (fork). In fact, the maximally fixated distractor often appeared to share basic features with the search target (i.e., shapes of the credit card and iPod are quite similar). Error bars represent one SEM.
Figure 4
 
Eye movement selection in the unexpected condition. The figure depicts mean number of fixations for the relevant cue object, average of all other distractors in the display, and the distractor that garnered the largest number of fixations (i.e., maximally fixated distractor), separated by search target. While the relevant cue object garnered an above average number of fixations, it is important to notice that it was only the maximally fixated distractor for one out of the four search targets (fork). In fact, the maximally fixated distractor often appeared to share basic features with the search target (i.e., shapes of the credit card and iPod are quite similar). Error bars represent one SEM.
Figure 5
 
Mean number of fixations for cue objects when contextually relevant, cue objects when contextually irrelevant, the average of all of the distractor objects, and the top 5 most salient object for each salience model that garnered the most fixations. Error bars represent one SEM.
Figure 5
 
Mean number of fixations for cue objects when contextually relevant, cue objects when contextually irrelevant, the average of all of the distractor objects, and the top 5 most salient object for each salience model that garnered the most fixations. Error bars represent one SEM.
Figure 6
 
Normalized click densities for expected location of each target object. Each click was convolved with a 2D Gaussian whose variability was determined by the standard deviation of the click response coordinates for each object. Resultant images were normalized with respect to the largest value across all four images. For each search target, click densities clearly peak at the location of their respective cue object (location indicated by white arrow).
Figure 6
 
Normalized click densities for expected location of each target object. Each click was convolved with a 2D Gaussian whose variability was determined by the standard deviation of the click response coordinates for each object. Resultant images were normalized with respect to the largest value across all four images. For each search target, click densities clearly peak at the location of their respective cue object (location indicated by white arrow).
Table A1
 
Results of eye movement analyses for rater ZF. For comparison, rater BB's values are presented in parentheses (as well as in the main text). Note that all statistical comparisons but one yield the same result.
Table A1
 
Results of eye movement analyses for rater ZF. For comparison, rater BB's values are presented in parentheses (as well as in the main text). Note that all statistical comparisons but one yield the same result.
Fixation counts across all 41 “unexpected” trials
Total fixations Cue relevant Cue irrelevant All other distractors Walther and Koch (2006) top Zhang et al. (2008) top
236 (327) 23 (27) 34 (38) 213 (300) 9 (13) 11 (9)
 
Descriptive statistics
n M SD
Cue relevant 24 0.54 (0.63) 0.62 (0.72)
Cue irrelevant 24 0.28 (0.30) 0.20 (0.24)
All other distractors 24 0.13 (0.18) 0.07 (0.10)
Walther and Koch (2006) top 24 0.23 (0.31) 0.36 (0.38)
Zhang et al. (2008) top 24 0.25 (0.19) 0.36 (0.32)
 
Statistical comparisons
t df p (one-tailed)
Relevant > all other distractors 3.31 (3.21) 23 0.002* (0.002*)
Relevant > irrelevant 2.20 (2.15) 23 0.02* (0.02*)
Irrelevant > all other distractors 4.26 (2.96) 23 < 0.001* (0.004*)
Relevant > Walther and Koch (2006) top 2.01 (1.93) 23 0.03* (0.03*)
Relevant > Zhang et al. (2008) top 2.17 (3.22) 23 0.02* (0.002*)
Walther and Koch (2006) top > irrelevant −0.51 (0.14) 23 0.61 (0.89)
Zhang et al. (2008) top > irrelevant −0.38 (−1.53) 23 0.71 (0.93)
Walther and Koch (2006) top > all other distractors 1.30 (1.70) 23 0.10 (0.051)
Zhang et al. (2008) top > all other distractors 1.73 (0.17) 23 0.05* (0.87)
 

Note: *p < 0.05.

Supplementary Movie
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×