Free
Research Article  |   July 2016
Task and context determine where you look
Author Affiliations
Journal of Vision July 2016, Vol.7, 16. doi:https://doi.org/10.1167/7.14.16
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Constantin A. Rothkopf, Dana H. Ballard, Mary M. Hayhoe; Task and context determine where you look. Journal of Vision 2016;7(14):16. https://doi.org/10.1167/7.14.16.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

The deployment of human gaze has been almost exclusively studied independent of any specific ongoing task and limited to two-dimensional picture viewing. This contrasts with its use in everyday life, which mostly consists of purposeful tasks where gaze is crucially involved. To better understand deployment of gaze under such circumstances, we devised a series of experiments, in which subjects navigated along a walkway in a virtual environment and executed combinations of approach and avoidance tasks. The position of the body and the gaze were monitored during the execution of the task combinations and dependence of gaze on the ongoing tasks as well as the visual features of the scene was analyzed. Gaze distributions were compared to a random gaze allocation strategy as well as a specific “saliency model.” Gaze distributions showed high similarity across subjects. Moreover, the precise fixation locations on the objects depended on the ongoing task to the point that the specific tasks could be predicted from the subject's fixation data. By contrast, gaze allocation according to a random or a saliency model did not predict the executed fixations or the observed dependence of fixation locations on the specific task.

Introduction
One of the essential properties of the human visual system is its active nature (Ballard, 1991; Ballard, Hayhoe, Pook, & Rao, 1997; Findlay & Gilchrist, 2003; O'Regan & Noë, 2001): Humans shift gaze between locations in the visual world on the order of three times per second, resulting in more than 150,000 a day. This means that the visual system executes an action in order to select a target for perception. To understand vision, it is therefore essential to study gaze deployment. Where is the gaze directed to in a scene, what computations are performed at the gaze point? 
Gaze control is often described in conjunction with visual attention. Although overt shifts of attention involving eye movements can be dissociated from shifts of covert attention (Posner & Cohen, 1984), it has been shown that voluntary saccadic eye movements involve a preceding shift of attention (Deubel, Shimojo, & Paprotta, 1997; Henderson, 2003; Kowler, Anderson, Dosher, & Blaser, 1995). Therefore, under most circumstances, the direction of gaze reflects ongoing computations and can be used to infer the moment-to-moment cognitive processing that subjects are engaged in (Ballard, Hayhoe, & Pelz, 1995; Liversedge & Findlay, 2000; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). 
Although it is clear what particular point gaze is directed to in a scene, the inference about what is processed at each moment in time is not so easily accessible. That is, if a subject fixates a particular object in a scene, it is not clear which features are being processed. Although human self-awareness seems to suggest a continuous perception of object identities and features, considerable evidence demonstrates that this may not be what is in fact represented (Ballard et al., 1995; Droll, Hayhoe, Triesch, & Sullivan, 2005; Paprotta, Deubel, & Schneider, 1999; Simons & Rensink, 2005; Triesch, Ballard, Hayhoe, & Sullivan, 2003). 
This fundamental distinction between the location of gaze and the computations being executed while directing gaze to a location, that is, the difference between looking and seeing, are reflected in two fundamentally different approaches to vision and gaze control in particular. Explanations of the allocation of gaze in a scene have usually emphasized either bottom up saliency (Itti & Koch, 2000; Oliva, Torralba, Castelhano, & Henderson, 2003; Parkhurst & Niebur, 2002), which is independent of the computational goal, or top-down cognitive control (Buswell, 1935; Hayhoe & Ballard, 2005; Land, 2004; Yarbus, 1967) in which the behavioral objective is stressed. These two approaches have emphasized different ways of thinking about vision and have therefore produced different experimental paradigms and different explanatory frameworks. Whereas the former is directed more toward a mechanistic implementation, the latter addresses the question of the goal of vision (Marr, 1982). 
Bottom up models
The bottom-up saliency assumption is based on the hypothesis that certain features of the visual scene inherently attract gaze; that is, that vision is essentially reactive and stimulus driven. This view is in part based on psychophysical studies (Krieger, Rentschler, Hauske, Schill, & Zetzsche, 2000; Mannan, Ruddock, & Wooding, 1996; Reinagel & Zador, 1999; Tatler, Baddeley, & Gilchrist, 2005) in which differences in image properties were observed between fixated and randomly chosen locations. The initial fixations when viewing a large number of two-dimensional photographic images are recorded and statistics for several features are compared between the gaze locations chosen by the subjects with those of randomly chosen parts of the scene. Such studies have produced mixed results regarding which features are significantly different at fixation locations. Whereas some researches found that luminance contrast was elevated at the point of gaze (Parkhurst & Niebur, 2003), other studies found that the edge density was significantly stronger at fixation locations (Baddeley & Tatler, 2006; Mannan et al., 1996). Subsequent models have been proposed that relate the bottom-up assumption to neuronal processing in cortical visual areas. Starting from the notion that “early” visual areas represent low level features such as oriented edges, such models have extracted analogous features from images and proposed methods by which a scalar saliency map could be calculated (Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985). These methods apply different forms of center-surround competitive algorithms in order to find regions of across-scale contrast within single feature dimensions and proceed to combine multiple maps to a single saliency map (Itti, 2000) by some weighting technique. Such feature saliency-based models have a large number of free parameters, which have to be adjusted in order to obtain meaningful saliency maps. It is necessary to choose the number of filters, their respective parameters such as orientations and spatial frequencies, as well as the spatial scales, the normalization functions, the summation rules, and the parameters of the network implementing the spatial competition within the saliency map. 
Bottom-up models have recently implemented contextual effects and top-down effects (Itti, 2000; Navalpakkam & Itti, 2005; Oliva et al., 2003; Torralba, Oliva, Castelhano, & Henderson, 2006), but these effects have been described as modulating the saliency map and have been restricted to the tasks of object detection or object recognition. Navalpakkam and Itti (2005) have proposed such a model in object detection. Starting from key words specifying a task, their saliency map is biased toward known image features of the corresponding target object. Torralba et al. (2006) also considered an object detection task and proposed a spatial modulation of the saliency map by an additional map representing the likely positions of target objects in a scene. Such a map was obtained by training a supervised algorithm on a database of images in which the likely positions of certain objects have been hand-labeled. Feature vectors were calculated for the image by obtaining the convolution with oriented filters at multiple spatial scales and by reducing the dimensionality of these feature vectors. The motivation for this approach is that the statistics of image features can provide a clue to more likely positions of particular object locations. The rationale is that there is a similarity in certain feature dimensions for locations in which a particular target is likely to be found and that humans extract such regularities through experience. Experimental work has assessed how well such bottom-up models can predict human gaze. Most of the saliency modeling extracting features such as orientation, luminance, and color opponency have used the so-called free-view task for the assessment of the accuracy of the prediction of such models (Itti, 2005; Parkhurst & Niebur, 2003; Tatler et al., 2005). The instructions for this task have subjects examining a scene without further instructions. The free-view task is unfortunately very uncontrolled. It is not clear what the subject is looking for or what the subject assumes about the setup that the experimenter designed. Given that most of these studies have used different metrics comparing the saliency maps with human fixations, it is difficult to assess how well the models describe human gaze. In summary, the first few gaze locations selected by humans when free-viewing fractal images or finding likely target locations in a set of similar images correlate somewhat with the predictions of bottom-up saliency models. However, Einhäuser and König (2003) randomly altered parts of natural scenes in their luminance contrast and the same analysis of the features at fixation locations was executed. Their study showed that the bias to fixate regions of high contrast disappeared after such manipulations. Thus, increased contrast at fixation location may be correlational in nature and not causal. Also, in a recent study by Henderson, Brockmole, Castelhano, and Mack (2006), the locations selected by gaze correlated not only with high local contrast but also semantic content related to the task so that the authors concluded that the relation between gaze and bottom-up saliency is correlational and not causal. All these reservations with bottom-up modeling serve to motivate the study of top-down gaze control, which by itself has a significant history. 
Top-down models
The influence of tasks on gaze control has been described since the experiments by Buswell (1935) and Yarbus (1967). These effects become even more prominent in tasks that are not based on picture viewing but study subjects actively involved in interacting with the environment when executing goal-directed behavior (Land & Hayhoe, 2001). Task-based models are required to model which features are in fact more likely to be fixated dependent on the particular task at hand and how semantic content influences the gaze strategy. In fact, it is well established that the ongoing task influences the gaze strategy. As soon as the visual scenes become meaningful, the objects are related by semantic meaning, or prior knowledge can be utilized in order to find likely locations for targets, these factors become significantly better predictors of the direction of gaze (Henderson & Hollingworth, 1999). The allocation of gaze has been studied in tasks such as copying arrangements of blocks (Ballard et al., 1995), making tea (Land, Mennie, & Rusted, 1999), making sandwiches (Hayhoe, Shrivastava, Mruczek, & Pelz, 2003), driving (Land & Lee, 1994), and other goal-directed behavior. See Hayhoe and Ballard (2005) and Land (2004) for reviews. 
A recurrent theme of task-based studies is the functional relation of gaze to the ongoing behavioral sequence (Hayhoe, 2000). This means that the direction of gaze during the execution of such a task can be predicted significantly better by the phase of the action sequence than by local features of the visual scene. In the game of cricket (Land & McLeod, 2000), the gaze is temporally linked to relevant events in the sequence of occurrences that are relevant to hitting an approaching ball. Under such circumstances, gaze may be directed to parts of the scene that are not distinguishable from other parts by any features other than that the subject is expecting a ball to move toward that location. In a block-sorting task in which subjects watched a person stacking a set of blocks, gaze was predictively directed toward expected points of interaction (Flanagan & Johansson, 2003). In experiments carried out by Johansson, Westling, Bäckström, and Flanagan (2001), subjects fixated relevant spatial locations as necessary for the ongoing hand movements with high temporal repeatability depending on the features of the manual interaction. In another block copying experiment (Ballard et al., 1995), subjects arranged colored blocks from a resource area according to a pattern shown in a model area. The observed eye movements showed a regular sequential pattern within the evolution of the task that could be interpreted in terms of momentary information processing needs. Fixations were directed to the model to obtain a block's color, then to a corresponding block of that color in the resource area, and then back to the model to get its position in the pattern, followed by a fixation toward the location in the work area to which the block was subsequently moved. 
Gaze has also been functionally interpreted in simple visuomotor tasks such as pointing and grasping. One commonly observed gaze strategy of subjects in picture viewing is to direct the initial fixation toward the center of the extended objects (He & Kowler, 1991) and to similarly target the “center of gravity” in cluttered scenes. In pointing, gaze and hand are often tightly correlated and are directed toward the point of contact (Frens & Erkelens, 1991). Recently Brouwer, Franz, and Gegenfurtner (in press) have investigated the difference between where subjects gaze when looking at an object and where they gaze when grasping them and found that subjects tended to gaze at the future contact point of the index finger. Johansson et al. (2001) described a complex gaze behavior in a more extended task in which subjects grasped objects, moved them to a target location, and sometimes avoided obstacles along the way. All these results point toward the fact that in extended behavior in which subjects execute actions, gaze is tightly linked to the ongoing demands of the executed task. 
In addition to the psychophysical results, an abundance of recent neurophysiological results demonstrate how cortical areas involved in representing signals involved in the planning and execution of eye movements are closely linked to the expected temporal delivery of a reward within the task (Glimcher, 2003; Schultz, 2000). Moreover, the functionality of so-called “early” perceptual areas has been demonstrated to reflect the behavioral state of the organism. Two recent reviews (Kayser, Körding, & König, 2004; Olshausen & Field, 2005) collect a number of arguments and empirical results that demonstrate that characterizations of V1 cells with task-neutral white noise stimuli do not adequately predict their responses to natural stimuli in awake behaving animals. In addition, several results show effects of the task on the activity of neurons as early as area V1. For example, in a series of experiments by Li, Piech, and Gilbert (2004), monkeys were trained on a bisection task and a Vernier acuity task. The stimulus was the same in both tasks but the monkey was cued which task to execute. The activity of V1 neurons as assessed by obtaining their tuning curves was dependent on the task the monkey was involved in. Moreover, the tuning curve switched dependent on the task on a time scale of the duration of trials. This is relevant to the modeling of biological mechanisms of visual attention because it demonstrates that the processing of visual stimuli in cortical regions that are often described as being “early” stages is active and dependent on the current behavioral goal of the agent. 
In view of this evidence for the central role of the ongoing task for the allocation of gaze, it is necessary to quantify the influence of tasks on vision and develop a theoretical framework describing where gaze is directed in complex extended tasks. A goal of such task-related models is to be able to formulate cost functions that can direct gaze in complex tasks. Trommershäuser, Maloney, and Landy (2003) have demonstrated that such cost functions can accurately describe human visuomotor behavior in a manual reaching task in which the reward structure was made explicit through monetary rewards. Nelson and Cottrell (2007) showed how gaze is directed to most disambiguating parts of an object in a shape-learning task. Sprague and Ballard (2003) devised a model of gaze control for an extended visuomotor task with multiple competing goals. They propose that control of gaze can be understood in the setting of a minimum-loss-strategy implemented using reinforcement learning. Starting from the notion that the visual system is actively seeking to extract specific information from the visual array, a model was proposed that gives an account of why eye movements could be directed toward certain parts of the scene, given the task demands. The agent is faced with a representation of the visual scene that enables it to extract relevant information about the state of the objects in the world. This model incorporates specific representations of the locations of the relevant objects in the scene. The eyes are directed toward relevant parts of the environments in order to update these internal representations. The crucial point is that concurrent demands are imposed on the system because the uncertainty about the environment increases with time, if the relevant information is not updated. In the proposed model, the eyes are moved toward parts of the scene where information is available to minimize the loss of reward. 
Quantifying task dependence
Gaze location does not uniquely specify the information being extracted. Note that attempts to consider task effects in the context of saliency models do not distinguish the ongoing computations whereas task-based models explicitly contain and represent the difference between directing gaze toward a particular part of the scene and the information being extracted. Thus, there is the need to clarify and quantify to what degree these task-based models are accurate descriptions of human behavior and to study human gaze behavior in dependence of the ongoing tasks. If vision is an active process that is dependent on the behavioral goal of the organism, vision needs to be studied when humans are engaged in purposeful, goal-directed behavior. Naturalistic tasks are therefore adequate to address the question how vision is used. Accordingly, two major challenges have to be addressed: (1) the design of goal-directed experiments in natural environments and (2) the techniques for measuring and describing goal-directed extended behavior. 
First, “natural task” is a loose description for tasks that fulfill several criteria. It is important that subjects are involved in such activities in everyday life and are not confronted with such a task for the first time when executing an experiment in a laboratory. In a visual search task, the stimuli may be artificial to probe feature searches (Wolfe, 1998) or may be more naturalistic; that is, they may be photographs of natural scenes (Krieger et al., 2000; Mannan et al., 1996; Reinagel & Zador, 1999; Tatler et al., 2005). But the task itself can be natural too such as in the previously mentioned studies studying gaze in tea making (Land et al., 1999) or sandwich making (Hayhoe et al., 2003). In an analogous fashion, human subjects can be highly engaged in computer games that use only very crude geometric representations of the world. Another criterion is that natural tasks are often trained over a long time and therefore do not require a high degree of learning when the subjects start an experimental session. By using virtual environments, it is possible to obtain a balance between the required complexity and enough control over the visual scene in order to change it parametrically. 
The second major aspect is how to quantify and analyze natural behavior. Complex environments such as a kitchen or a sports court contain numerous objects and visual stimuli that change over time. Additionally, subjects may be interacting and changing the environment in real time. The dimensionality of the stimulus or the state space describing the environment is very large. Similarly, complex behavior such as movements through the world, the gaze behavior, and the interactions with objects in the scene inherently are of a high dimension too. Accordingly, tools have to be developed in order to represent these high-dimensional data for representation and analysis. 
Specific aims
This article presents psychophysical results that address both of the above issues. The tasks considered here are extended visuomotor tasks that require goal-directed interaction with the environment. Human subjects navigate through a VR environment, approach, and avoid objects while executing sequences of approximately 500 fixations per trial on average. By having subjects execute single tasks and the task combination, the differences in the allocation of gaze as a function of the respective gaze can be observed. These differences are then quantified more precisely. 
The experiments presented here were devised in order to compare how well task-based models and bottom-up saliency models can explain the distribution of fixations observed in extended tasks, which cannot be conceptualized as being a search task or an object detection task. Relevant parameters of behavior that can give clues about the ongoing processing include fixation location, fractional gaze allocation, and fixation durations. Related to this question is how fixations are dependent on object identities, visual features, and task features. Specifically, the experiments address whether the location of a fixation on an object depends on the task that is executed, how prominent a role distractors play when executing extended visuomotor tasks, how the fixation duration on objects depends on the task, and how global scene context affects fixation distributions. Moreover, by examining single tasks and their compositions, the question, how behavior in the combined task is related to the component tasks, can be addressed. 
Methods
Experimental setup
Subjects were immersed into a virtual reality environment consisting of a cityscape (Performer Town) created by SGI. They wore a Virtual Research V8 head mounted binocular display. The resolution of the stereo LCD screens in the headset was 640 × 480 pixels corresponding to a horizontal field of view of 52°. The helmet also contained monocular eye tracking capability using an Applied Science Laboratory (ASL) 501 video-based eye tracker. The eye position was calibrated before each trial using a 9-point calibration target. Given the average trial duration across subjects, this calibration was carried out every 108 s. This frequent calibration was crucial in maintaining accuracy significantly below 1° of visual angle. In addition, the rotational and the translational degrees of freedom of head movements were monitored with a HiBall-3000 tracker. The head tracker had a latency of a few milliseconds so that the frame update in the HMD is between 30 and 50 ms. The scene was rendered using a Silicon Graphics Onyx 2 computer at a rate of 60 Hz. 
Three data streams were recorded simultaneously. First, the current position and orientation of the head mounted display, the gaze position relative to the display, and the current time code at which the sample was taken were written to a file. Sampling frequency was at 60 Hz. The second data stream contained Hi-8 video recordings of the scene as seen by the subject with superimposed cross hair representing the point of gaze and an image of the monitored eye. The third data stream consisted of a recording of the scene as seen by the subject onto digital video. The videos and the data stream were synchronized at the beginning and at the end of each trial by inserting visual markers into the video streams and tag data into the data file. Given that all systems used the same clock, single frames of the video could be directly related to the corresponding data obtained from the eye tracker. 
One problem in this environment was that the linear track of the path in the cityscape was many times longer than the 7-m width of the laboratory. Our solution to this discrepancy was to map a curved path in motor space onto a linear path in visual space. That is, in order to experience a linear path in visual space, the subjects had to walk a circular path in the laboratory. The path that a subject walked along a single trial took about four laps in the laboratory space. A similar approach was used by Razzaque, Swapp, Slater, Whiton, and Steed (2002). Subjects were given enough practice in this environment until they reported being comfortable with the mapping before starting the experimental trials. With practice, subjects perceived the walkway track as being linear. 
Experimental conditions
The environment in which subjects were immersed consisted of a linear walkway of length 40 m and width 1.8 m within the cityscape. At the end of this walkway, subjects arrived at a road crossing where the trial ended. A total of 40 purple and 40 blue rectangular objects were placed along the walkway. These objects were placed randomly according to a uniform distribution that expanded 1.5 m to both sides of the walkway on normal trials. On half of the trials, the purple objects had a height of 1.5 m and the blue objects had a height of 2 m, whereas on the other half of the trials the heights were exchanged. Moreover, on half of the trials, the purple objects were described to the subjects as being “litter” and the blue objects were described as “obstacles.” On the other half of the trials, the blue objects were termed “litter.” The random positions of the objects were different across the task conditions but the same across subjects; that is, all objects were always at the same position in the “pickup” condition across all subjects independent of the color. A typical trial consisted of the subject being immersed into the environment and standing still at the beginning of the walkway. Subjects then listened to the instructions describing the current task they were asked to carry out for approximately 15 s. After they had listened to the instructions, subjects proceeded to walk along the walkway and executed the task. At the end of the walkway, subjects arrived at the intersection and waited for approximately 15 additional seconds at which point a single trial ended. The overall duration of a single trial was 1 min and 48 s on average. The time during which subjects were walking along the walkway and executing the instructed task was 80 s on average with a standard deviation of 25.0 s. 
The task priorities were changed across trials by giving the subjects different verbal instructions. In condition “pickup litter,” subjects were instructed to pickup the litter objects, which were purple in one half of the conditions and blue in the other half. Picking up was achieved by approaching the litter object, which disappeared when the subject's body reached a distance of 30 cm. The second condition consisted of the “avoid obstacles” condition, in which the task was to avoid the obstacle objects. In the “combination” condition, subjects both picked up the litter objects and avoided the obstacle objects. The order in which individual subjects carried out these three tasks was randomized across subjects, but the spatial arrangements were the same across subjects. On additional “salient” trials, a large number of additional objects were introduced into the scene. These objects were multicolored, in part moving, or changing shape. Their common property was that they score high on common measures of visual saliency such as high luminosity contrasts, high color contrasts, or high edge density with respect to the entire scene. In the “salient” condition, subjects carried out the same task as in the “combination” condition. These “salient” trials were presented to the subjects after they had carried out one to three training trials in order to familiarize themselves with the environment and three more trials corresponding to the “pickup,” “avoid,” and “combination” conditions. Accordingly, subjects were not expecting the new, additional, and surprising objects in the scene. Additionally, four subjects were presented with a “narrow” condition, in which the same four tasks including purple litter objects and blue obstacles as described above were executed but the placement of the objects was altered in such a way that these objects were confined to the width of the walkway. Figure 1 shows individual views of the scene for three conditions. 
Figure 1
 
Subject's view of the walkway from the starting position in three different conditions. Left: “pickup purple litter and avoid blue obstacles” condition with normal spatial distribution of objects on the walkway. Middle: same condition as in the left view but with a tighter distribution of objects. Right: normal spatial distribution of objects on the walkway with additional salient objects in the scene.
Figure 1
 
Subject's view of the walkway from the starting position in three different conditions. Left: “pickup purple litter and avoid blue obstacles” condition with normal spatial distribution of objects on the walkway. Middle: same condition as in the left view but with a tighter distribution of objects. Right: normal spatial distribution of objects on the walkway with additional salient objects in the scene.
All subjects were undergraduates at the University of Rochester who were compensated for their participation. Subjects were naive with respect to the purpose of the experiment. 
Analysis of experimental data
The data provided by the eye tracker were analyzed in order to segment the eye movements at saccades. These were determined using in-house Fixation Finder software, which implements an adaptive velocity-based algorithm. The algorithm obtains an estimate of the noise level present in the signal during the entire trial and then compares this estimate with a local estimate of the noise present in a window of one s around the current data sample. The algorithm then changes the current threshold depending on the current noise estimate. All recorded fixations needed to meet the additional criteria of having angular velocity less than 65 deg/s for at least 60 ms and occurring less than 30 ms apart and being displaced by more than 1° of visual angle. Data collected during a track loss were excluded. The automated segmentation had previously been compared to manual frame-by-frame analysis obtained from three different experts on data recorded from other experiments and could not be distinguished from their classification. It should be noted that under the conditions described, human subjects executed sequences of complex head and eye movements including frequent vestibular ocular reflexes superimposed on fixations and saccades as described in Pelz and Rothkopf (2007). 
Data from a total of eight subjects were excluded for two reasons. First, three subjects needed considerably longer to finish at least one of the total of four trials. Data from these subjects, who reported to feel “uncomfortable” in the virtual environment, were excluded. Second, because of the low quality of five of the obtained eye tracks, these data could not be used for further analysis. All in all nineteen subjects were able to navigate the VR environment comfortably and provided good eye tracking data. 
Given that the position and the gaze direction are known in the scene, the intersection of the gaze vector can be determined for each moment in time in the VR environment. But this procedure is not very robust. If for example subjects consistently tended to fixate at the walkway close to the edge of an obstacle, utilizing only the object class at fixation location would not reveal this contingency. Therefore, the video sequence showing the scene from the point of view of the subject was utilized in order to extract the object class and the visual features at the point of gaze. In-house Matlab (Mathworks) functions were used to extract the central patch at fixation for each frame of the video sequence. This patch was the circular region of 3° diameter around the point of gaze. This choice was motivated by two factors. First, although the density of cones in the retina falls off continuously, the central rod free region of highest acuity is approximately 2° in diameter. Secondly, data were only used if the inaccuracies due to the eye tracker were approximately less than 1°. Therefore, a region of total size of 3° was used. 
The color at fixation was used in order to determine the object class at the point of gaze for each frame during the trial. Each fixation was then classified as being on the object class that was present at the point of gaze most of the time during each fixation. This took advantage of the fact that the virtual environment was designed in such a way that the colors of the objects clearly identified the object category. The classes utilized were “litter,” “obstacle,” “walkway,” “lawn,” “other,” and “none.” The class “none” included saccades, track losses from the eye tracker, and fixations at points in the scene that were too close to the boundaries of the scene, so that a patch of 1° diameter could not be extracted. The classes “litter” and “obstacle” were assigned depending on the mapping of the colors purple and blue in the respective trial. The classes “walkway” and “grass” were the same for all conditions. Finally, the object class “other” comprised all buildings and roads as well as the trees in the surroundings of the scene. Additionally, in the trials in which a large number of salient objects were introduced in the scene, the additional object category “salient” was used. 
Furthermore, the properties of visual features at the point of gaze were investigated. The analysis of the data in terms of the local image properties was based on the three-dimensional image cube obtained by extracting the image patch at fixation location for each video frame. Here results are reported for the response of two derivative of Gaussians filters oriented horizontally and vertically. The central image patch was normalized and convolved with the two filters. Response distributions were obtained for the two filters for gaze directed to litter in the “pickup” condition and obstacles in the “avoid” condition. 
Although the image cube contains the relevant data for the analysis, this type of data structure is difficult to represent. To obtain an informative two-dimensional rendition of the scan path, we developed a visualization of a trial, as shown in Figure 2. The central fixation patch of 3° diameter was unrolled in an outward spiraling way for each time sample. This was achieved by transforming the patch to polar coordinates and selecting all pixels over all angles for successively increasing discrete-valued radii. The obtained linear patch represents the fixated region with increasing eccentricity from the central fixation point mapped to a linear distance. The center of fixation is represented at the bottom of this linear strip and distance from the fovea increases upward. As Figure 2 demonstrates, this visualization shows gaze directed toward an edge of an object as a sequence of stripes whereas the central part of an object shows up as a solid line. Concatenating these vertical stripes over the duration of a trial results in a visualization of the gaze targets in two dimensions as shown in Figure 3
Figure 2
 
Center: typical view of the subject during execution of the task showing the walkway with blue obstacles and purple pickup objects, the cityscape in the background. Left: schematic representation of the scene context represented by the color histogram of the scene, which shows the proportion of colors summed over the entire field of view. Right: representation of the image patch at two fixation locations (see text). Note how gaze directed to the central region of a solid object is represented as a solid line whereas gaze directed toward an edge is represented as a succession of stripes.
Figure 2
 
Center: typical view of the subject during execution of the task showing the walkway with blue obstacles and purple pickup objects, the cityscape in the background. Left: schematic representation of the scene context represented by the color histogram of the scene, which shows the proportion of colors summed over the entire field of view. Right: representation of the image patch at two fixation locations (see text). Note how gaze directed to the central region of a solid object is represented as a solid line whereas gaze directed toward an edge is represented as a succession of stripes.
Figure 3
 
Visualization as introduced in Figure 2 of the image patch at the center of gaze for subject M. L. in two different conditions. The X-axis corresponds to the normalized trial duration and the Y-axis is foveal eccentricity in degrees. Top: “pickup purple litter”; bottom: “avoid blue obstacles.” The duration of the entire trial consists of the time the subject spent immersed in the environment and listening to the instructions, the time during the execution of the task, and the time during which the subject has finished the task and is waiting to start the next trial. The figure clearly shows the predominance of fixations on purple objects for the pickup task and a similar predominance of fixations on blue objects in the obstacle avoidance task.
Figure 3
 
Visualization as introduced in Figure 2 of the image patch at the center of gaze for subject M. L. in two different conditions. The X-axis corresponds to the normalized trial duration and the Y-axis is foveal eccentricity in degrees. Top: “pickup purple litter”; bottom: “avoid blue obstacles.” The duration of the entire trial consists of the time the subject spent immersed in the environment and listening to the instructions, the time during the execution of the task, and the time during which the subject has finished the task and is waiting to start the next trial. The figure clearly shows the predominance of fixations on purple objects for the pickup task and a similar predominance of fixations on blue objects in the obstacle avoidance task.
Additionally, a global representation of the scene was obtained from the color histogram of the field of view for each frame. This color histogram of the scene in the field of view of a subject is shown on the left of Figure 2. The color histogram is a general representation of the distribution of colors present in an image. It is obtained by counting the number of pixels of a particular color. Given that the color range in the virtual environment was controlled, the individual colors could be segmented without overlap and correspond to individual object classes. 
Finally, in order to compare the human gaze distributions to a random gaze allocation and saliency models of gaze allocation, fixation sequences were constructed for each subject and each task by selecting a particular location in the scene on the last frame before the subject had actually executed a saccade in the trial. The scene visible in the subject's field of view at each moment in time was recorded in the video stream so that the next fixation location according to different gaze distribution models could be determined. The gaze allocation was modeled (1) with a random gaze distribution in which a location weighted by the subject's distribution of gaze relative to the field of view and (2) according to the saliency model Itti (2000). 
Given that the standard saliency model contains approximately 40 free parameters, these values had to be chosen. The parameters were adjusted to match the values mentioned in Itti (2000) and Itti and Koch (2000). Additionally, those parameters for which values were not given in the references or which had different values in these references were chosen to reproduce the published saliency maps in Itti most faithfully. 
Although most of the discussion is based on the proportion of time spent on the different object classes, inferential statistics were not performed on the necessarily correlated percentages but on the original total fixation times. A two-factor within-subjects ANOVA with repeated measures was performed with the task and object class factors. The durations of individual fixations were compared using a one-way ANOVA with repeated measures. The Greenhouse–Geisser correction was used in all cases in order to take possible violations of sphericity in the repeated measures data into account. The p values for these tests, which all used a significance level of α = .05, are reported. Because of the intricacies relating to multiple comparisons with repeated measures, paired t tests with Bonferroni correction were used post hoc in comparing individual looking times on object classes across tasks. This type of correction of the applicable p values is quite conservative because it takes all possible pairwise post hoc tests into account. 
Results
Effect of the task on fixation proportions
The proportion of fixations of subjects on the different object classes was highly dependent on the particular task they were executing. Figure 4 shows the representation of the gaze sequences described above for 10 subjects while executing the tasks “pickup purple litter” and “avoid blue obstacles.” These representations are useful because they provide a simple visualization of the similarities of the gaze sequences across subjects, the difference in gaze allocation across tasks, as well as revealing details about the features at the point of gaze. This representation already carries relevant information accessible to visual inspection. First, the proportion of colors in the two conditions shown is different and reflects the conditions of picking up purple litter versus avoiding blue obstacles. Secondly, subjects tend to fixate regions of uniform color distribution more than regions containing a high density of edges. The center of objects results in a solid stripe at each moment in time whereas regions with edges are represented as a succession of horizontal bars of different colors as shown in Figure 2. These observations can be quantified by obtaining the total times spent fixating the different object classes for the different conditions. This is shown in Figure 5, averaged across subjects. 
Figure 4
 
Comparison of the gaze over time for 10 subjects labeled S1 to S10 using the visualization introduced in Figure 2. Note that the sequences shown in Figure 3 represent the entire trial duration, whereas the sequences shown here are taken only from the time during the trial in which the subjects were moving. Left: gaze visualization during the execution of the task “pickup purple objects.” Right: gaze visualization during the execution of the task “avoid blue obstacles.”
Figure 4
 
Comparison of the gaze over time for 10 subjects labeled S1 to S10 using the visualization introduced in Figure 2. Note that the sequences shown in Figure 3 represent the entire trial duration, whereas the sequences shown here are taken only from the time during the trial in which the subjects were moving. Left: gaze visualization during the execution of the task “pickup purple objects.” Right: gaze visualization during the execution of the task “avoid blue obstacles.”
Figure 5
 
Proportion of fixation time spent on the object classes across subjects ( n = 19) for four different tasks. The shown proportions are colored according to the color convention depicted in Figure 1; that is, (1) purple represents pickup objects, (2) blue represents obstacles, (3) gray represents the walkway, (4) green represents the lawn and the tree, (5) represents the background buildings, and (6) light green represents the salient distractors. The diagrams were obtained by averaging over the two color conditions. From left to right: “pickup purple litter,” “avoid blue obstacles,” “pickup purple litter and avoid blue obstacles,” “pickup purple litter and avoid blue obstacles” with salient distractors in the scene. Error bars are ±1 SEM across subjects.
Figure 5
 
Proportion of fixation time spent on the object classes across subjects ( n = 19) for four different tasks. The shown proportions are colored according to the color convention depicted in Figure 1; that is, (1) purple represents pickup objects, (2) blue represents obstacles, (3) gray represents the walkway, (4) green represents the lawn and the tree, (5) represents the background buildings, and (6) light green represents the salient distractors. The diagrams were obtained by averaging over the two color conditions. From left to right: “pickup purple litter,” “avoid blue obstacles,” “pickup purple litter and avoid blue obstacles,” “pickup purple litter and avoid blue obstacles” with salient distractors in the scene. Error bars are ±1 SEM across subjects.
The figure shows that the distribution of gaze across the different object classes varies in the four different conditions. This was tested with a two-way ANOVA with repeated measures and Greenhouse–Geisser correction. The interaction between task and object class was significant ( p < .001). Subjects preferentially fixated the litter more in the “pickup” condition than that in the “avoid” condition whereas fixations on the obstacles were larger in the “avoid” condition than that in the “pickup” condition. Note also that the proportion of time that subjects fixate the walkway increased in the “avoid” condition. Indeed, the differences were significant for litter ( p < .0002), obstacles ( p = .018), and walkway ( p < .001) when comparing the conditions “pickup” and “avoid,” as revealed by post hoc paired t tests with Bonferroni correction. 
The distributions shown in Figure 5 are obtained from both color conditions and are therefore independent of the color identity of the respective object class. Note that the variability in the time spent on the different object classes across subjects is remarkably small. As an example, in the “pickup” condition, subjects spent an average of 18% of the time on blue objects with an SEM of only 0.05%. This demonstrates that the gaze deployment behavior was very similar across subjects. 
The distribution of gaze time on the different object classes in the “combination” condition is intermediate between the distributions observed in the component tasks. Although the differences in fixation time in the “pickup” and “avoid” conditions were statistically significant, the difference between each of these conditions and the “combination” condition were not statistically significant for litter, obstacles, and the walkway. This may be due to the fact that the variability in the looking times was still large due to the different trial durations across subjects and the fact that the Bonferroni correction results in conservative p values due to the fifteen different pairings of object classes. Inspection of the individual subject's data nevertheless revealed consistent performance. In 18 of 19 subjects, litter fixation times were intermediate in value for the combined condition and for 17 of 19 subjects obstacle fixation times were intermediate in the combined condition. 
These distributions also demonstrate that most fixations were directed toward objects relevant for the pickup and avoidance tasks and that the grass and the buildings in the background were fixated together across all conditions 15% of the time on average. Differences for gaze directed to the background objects were not significant ( p = .549). Thus, during execution of the task, subjects predominantly fixated objects that were relevant for the ongoing task, and changes in the task priorities resulted in changes in the proportion of gaze being directed to the different object classes. 
The effect of the salient distractors in the scene on the proportion of gaze during the execution of the task can be regarded as minimal. A comparison between the mean gaze proportions spent on the different object classes for the “combination” and “salient” conditions shows that the two distributions are almost identical. To compare the two conditions, we excluded the fixation durations on salient objects because no salient objects were present in the scene in the combination condition. The distribution of fixations in the two conditions was not significantly different ( p = .35, two-way ANOVA with repeated measures). Only 0.2% of the time was spent fixating the salient objects in the scene in the “salient” condition. Thus, the objects that have a high “saliency” value did not attract gaze during the execution of the walking, the approaching, and the avoiding tasks. 
Effects of the task on the selected features
If a subject is fixating an edge, the spiral-coding scheme shown in Figure 2 will repeatedly alternate in color as it crosses that boundary. Inspection of Figure 4 reveals more of these alternating patterns in the “avoid” condition, suggesting that subjects more frequently looked at edges in this condition. To quantify this, we used two methods. First, each saccade that was directed toward a litter in the “pickup” condition and an obstacle in the “avoid” condition was marked on the surface of the corresponding object. The colored bars in Figure 6 show the results for a single subject. The saccade target patterns suggest that the subject indeed was more likely to target the edge of the obstacles whereas the center of the litter objects were more likely to be targeted in the “pickup” condition. 
Figure 6
 
Horizontal and vertical marginal distributions of gaze targets on litter objects in the “pickup” condition (left) and on obstacles in the “avoid” condition (right). These distributions were obtained using data from all 19 subjects. The plots with targets marked on the objects were obtained from the data of subject M. B. and are representative of the entire data set.
Figure 6
 
Horizontal and vertical marginal distributions of gaze targets on litter objects in the “pickup” condition (left) and on obstacles in the “avoid” condition (right). These distributions were obtained using data from all 19 subjects. The plots with targets marked on the objects were obtained from the data of subject M. B. and are representative of the entire data set.
The distribution of saccade targets for the two objects in the two conditions was obtained across all subjects. The horizontal and vertical projections correspond to the marginal distributions and are shown at the top and the side in Figure 6 and demonstrate that subjects were more likely to target a location closer to the center of the object in the “pickup” condition compared to the “avoid” condition, in which subjects directed gaze closer toward the edge of the obstacles. Furthermore, subjects tended to look more toward the lower part of the obstacles when avoiding them. These effects are robust across subjects as they are based on a total of 1,917 saccades. 
A second way of quantifying this tendency is to measure the responses of edge filters at the fixation location for the two object classes. A total of 45,000 patches at the location of the point of gaze were sampled across subjects from the time during which the gaze was relatively stationary, that is, during fixations. Responses to the first directional derivatives of Gaussian filters were calculated at multiple spatial scales separately for gaze directed toward litter in the “pickup” condition and obstacles in the “avoid” condition, after normalization of the image patch. Given that both the litter and the obstacles were not textured, responses of the edge filters are highly sensitive to edges between objects. Figure 7 shows the distribution of the filter responses and demonstrates that the fixations directed toward obstacles were more likely to contain a vertical edge compared to fixations directed toward pickup objects. Again, this effect showed a small degree of variability across subjects, as reflected by the small standard errors of the means. The fact that subjects tended to direct their gaze closer to the edge of obstacles when avoiding them can therefore also explain why the proportion of gaze directed toward the walkway increased from the “pickup” to the “avoid” condition, as noted when comparing the proportion of gaze times on object classes. When avoiding obstacles, gaze landed often close to the edge of the obstacle resulting in the fixation being classified as falling onto the walkway. The additional analysis of the specific features at fixation location together with the map of the fixations relative to the pickup objects and the obstacles shows that the increased proportion directed toward the walkway reflects the difference in features targeted depending on whether subjects walked toward or around the object ( Figure 8). 
Figure 7
 
Distribution of horizontal and vertical edge filter responses separated for gaze targeting litter in the “pickup” condition and obstacles in the “avoid” condition together with the saccade targets for the corresponding tasks by one subject. Left: responses of the horizontal and vertical first derivative of Gaussian filters at the target of the saccade directed to litter in the “pickup” condition. Right: filter responses at the target of the saccade directed to obstacles in the “avoid” condition.
Figure 7
 
Distribution of horizontal and vertical edge filter responses separated for gaze targeting litter in the “pickup” condition and obstacles in the “avoid” condition together with the saccade targets for the corresponding tasks by one subject. Left: responses of the horizontal and vertical first derivative of Gaussian filters at the target of the saccade directed to litter in the “pickup” condition. Right: filter responses at the target of the saccade directed to obstacles in the “avoid” condition.
Figure 8
 
Probability of fixating the pickup objects and the obstacles given the proportion of the respective object class in the current field of view. Left: proportion of fixation on a pickup-object (purple) and an obstacle (blue) given the proportion of pickup objects in the current scene (0–0.25, 0.25–0.5, 0.5–0.75, 0.75–1.0). Right: proportion of fixation on a pickup object (purple) and an obstacle (blue) given the proportion of obstacles in the current scene (0–1.0).
Figure 8
 
Probability of fixating the pickup objects and the obstacles given the proportion of the respective object class in the current field of view. Left: proportion of fixation on a pickup-object (purple) and an obstacle (blue) given the proportion of pickup objects in the current scene (0–0.25, 0.25–0.5, 0.5–0.75, 0.75–1.0). Right: proportion of fixation on a pickup object (purple) and an obstacle (blue) given the proportion of obstacles in the current scene (0–1.0).
Effect of the scene context
As described above, subjects still fixated litter objects 26% of the time when instructed to avoid obstacle objects. What is responsible for these fixations? Figure 9 shows the fixation sequences of one subject in the two conditions “pickup litter” and “avoid obstacles” together with the color histogram of the scene visible in the field of view. Visual inspection of these plots suggests that the proportion of fixations on the object classes depended on the task priority but were also influenced by the proportion of the scene covered by objects of a particular class. In the “pickup purple object” condition, subjects tended to fixate blue obstacles only if the proportion of purple objects within the field of view was small compared to the other object classes. Again, these observations were quantified by extracting the proportion of fixations on the respective object classes, given the distribution of litter objects in the current field of view of subjects. The proportion of the visual field covered by litter objects was extracted from the video sequence for each frame and then related to the current object class at the point of gaze. 
Figure 9
 
Same representation of the fixation distribution of subject S. G. over time as in Figure 3. Additionally, the color histogram of the scene in the field of view of the subject, as described in the methods section and shown in Figure 2, has been plotted for each of the two trials. Top: the fixations and color histogram for “pickup purple litter objects.” Bottom: the fixations and color histogram for “avoid blue obstacles.”
Figure 9
 
Same representation of the fixation distribution of subject S. G. over time as in Figure 3. Additionally, the color histogram of the scene in the field of view of the subject, as described in the methods section and shown in Figure 2, has been plotted for each of the two trials. Top: the fixations and color histogram for “pickup purple litter objects.” Bottom: the fixations and color histogram for “avoid blue obstacles.”
Figure 8 shows the proportion of time spent fixating litter and obstacles, depending on the proportion either litter obstacles in the filed of view, averaged across subjects. It therefore quantifies to what degree the scene context influences the object class selected by gaze. These graphs reveal that the more the scene is covered with objects that are relevant for the task, the higher the probability of a fixation on the task-relevant object. Correspondingly, the fixation time spent on the task-irrelevant objects decreases. Note that the small standard errors demonstrate that this effect is robust across subjects and color mappings. In summary, these histograms demonstrate that the context of the current scene is highly predictive of the type of object class that is fixated at each moment in time, given the particular task the subject is involved in. 
Effect of the spatial distribution of objects
A model of this setting where all the individual tasks competed for the gaze vector produced many more fixations on the walkway than actually observed in our subjects (Sprague & Ballard, 2003). We hypothesized that our subjects were taking advantage of the fact that litter was always on the walkway to stay on the walkway by heading to litter. We tested this hypothesis with an additional condition where four subjects were asked to pickup litter objects and avoid obstacles while staying on the walkway, but the distribution of the objects relative to the walkway was altered. Figure 1 shows the different spatial distributions of the objects on the walkway. In the “normal width” condition, the objects extended over the walkway into the region of the lawn. This was the common distribution that most subjects executed. The “tight” condition restricted the position of the objects to the walkway. Figure 10 demonstrates that this affected the proportion of fixations directed to the walkway. Although subjects fixated the walkway 18% of the time in the “normal width” condition, they only fixated it 4% of the time in the “narrow” condition (significance of condition: p = .004 two-way ANOVA). This suggests, that subjects were able to reduce the number of fixations directed toward the walkway because the information that is acquired during looking at the walkway in the other conditions can be obtained from the fixations directed toward the objects that are positioned on the walkway. 
Figure 10
 
Comparison of proportion of fixations on object classes between the normal condition and the “narrow” condition in which the objects were placed only on the walkway. Top row: “pickup” and “avoid” in the normal condition. Bottom row: “pickup” and “avoid” in the narrow.
Figure 10
 
Comparison of proportion of fixations on object classes between the normal condition and the “narrow” condition in which the objects were placed only on the walkway. Top row: “pickup” and “avoid” in the normal condition. Bottom row: “pickup” and “avoid” in the narrow.
Effect of salient distractors
In the “salient” condition in which subjects executed the same task as in the “combination” condition, the allocation of gaze to the different object classes was separated for the duration of the task itself and the time in which the subjects were immersed in the environment but were not yet or no longer executing the task. Figure 11 shows a typical example of a sequence of fixations in this condition. At the beginning of the trial, the subject is immersed in the cityscape and is listening to the verbal instructions. During this time, the subject is exploring the visual scene and a large number of the executed fixations fall on the salient objects and the background containing the buildings of the cityscape. While the subject is then executing the task of staying on the walkway, picking up litter objects, and avoiding obstacles, almost no fixations fall on those objects that score high on the saliency measure. Subjects then reach the road crossing at which point the task is finished and they are waiting for the trial to end. During this time, a large proportion of fixations is again directed toward the “salient” objects in the background. Figure 14 quantifies these differences by obtaining the proportion of gaze time spent on the different object classes. The differences between looking times were significant for interactions between object classes and task execution ( p < .001 two-factor ANOVA with repeated measurements). Thus, subjects fixated the salient objects when they were not executing a specific task. During the time in which they were listening to the current task instructions and when they were waiting for the trial to end, subjects divided their gaze much more evenly across the different object classes, including the salient objects ( Figure 12). 
Figure 11
 
Visualization as introduced in Figure 2 of the unrolled gaze for subject NM in the condition “pickup litter objects and avoid obstacles” when additionally a large number of salient objects are present in the scene.
Figure 11
 
Visualization as introduced in Figure 2 of the unrolled gaze for subject NM in the condition “pickup litter objects and avoid obstacles” when additionally a large number of salient objects are present in the scene.
Figure 12
 
Proportion of fixations on object classes for the condition in which a large number of additional salient objects were immersed into the scene. Left: proportion of fixations during the execution of the task. Right: proportion of fixations before and after executing the task.
Figure 12
 
Proportion of fixations on object classes for the condition in which a large number of additional salient objects were immersed into the scene. Left: proportion of fixations during the execution of the task. Right: proportion of fixations before and after executing the task.
Effect of the task on the fixation durations
Several studies in the past have demonstrated that differential fixation durations can be related to cognitive processes in executing tasks (Hayhoe, 2004; Pelz, Canosa, & Kucharcyk, 2000) (Figure 13). Figure 15 shows the average fixation duration for the subjects separated by task. First, a two-way ANOVA with repeated measures confirmed significant differences between fixation times (interaction between tasks and objects class: p = .0003). The mean fixation durations were 0.35 s on average and comparable across tasks for the object classes of pickup objects, obstacles, walkway, and grass. By contrast, the fixation duration for gaze directed to the background or the salient objects was significantly shorter with a mean duration of 0.17 s in all conditions (p < .01 for all paired t test with Bonferroni correction). Furthermore, fixation durations on litter were significantly longer than those for gaze directed toward obstacles (all paired t tests with Bonferroni correction p < .01). These results suggests that subjects executed different computations while looking at the background or the salient objects compared to when they looked at objects they were navigating toward or objects they were avoiding. 
Figure 13
 
Average fixation durations on object classes for the four conditions. From left to right: “pickup,” “avoid,” “combination,” and “salient.”
Figure 13
 
Average fixation durations on object classes for the four conditions. From left to right: “pickup,” “avoid,” “combination,” and “salient.”
Comparison to random gaze allocation and feature saliency-based models
As described above, fixation targets were calculated according to the saliency model described in (Itti & Koch, 2000) for each subject whenever a saccade was executed. Figure 14 shows the three conspicuity maps and the resulting saliency map together with the location of most likely fixation after applying spatial competition to the saliency map for one particular scene. The obtained sequence of fixations according to the model was used to obtain the proportion of gaze on each of the object classes and compared to the actually observe proportions resulting from the subject's gaze sequences. As a comparison, Figure 15 shows the gaze visualizations for the original gaze sequence for one subject in the conditions “pickup,” when litter was purple and obstacles were blue. These are compared to the traces obtained by randomly choosing a fixation location within the field of view at the current location of the subject and the most salient location within the field of view at the current location of the subject. 
Figure 14
 
From left to right: an original scene from the sequence recorded from the interaction of the subject with the environment for which the saliency map was calculated. Conspicuity maps for intensity, orientation, and color, and the final saliency map after applying the winter-take-all competitive network to the sum of the conspicuity maps.
Figure 14
 
From left to right: an original scene from the sequence recorded from the interaction of the subject with the environment for which the saliency map was calculated. Conspicuity maps for intensity, orientation, and color, and the final saliency map after applying the winter-take-all competitive network to the sum of the conspicuity maps.
Figure 15
 
Comparison of gaze visualizations obtained from subject S. G. under condition “pickup” (top) with the visualizations obtained using a random gaze allocation (middle) and the described saliency model (bottom).
Figure 15
 
Comparison of gaze visualizations obtained from subject S. G. under condition “pickup” (top) with the visualizations obtained using a random gaze allocation (middle) and the described saliency model (bottom).
Visual inspection of Figure 4 points toward the similarity of the sequences across subjects, and Figure 15 hints at the differences between the human gaze targets and the two models. Neither the random selection of fixation points nor the saliency model describes the sequential order of fixations well. The dissimilarity with the saliency model is striking. One possible comparison between the predictions by the models and the observed gaze targets is again to obtain the proportions of fixations on the respective object classes. These proportions on the object classes obtained by averaging across subjects is shown in Figure 16 for the random gaze model and for the saliency model. First, although inspection of the histograms describing the gaze allocation for the random model suggests a bias toward litter in the “pickup” condition and a bias toward obstacles in “avoid” condition, these differences were neither significant with respect to the task ( p = .749) nor to an interaction between task and object classes ( p = .503) as assessed by a two-way ANOVA with repeated measures. On the other hand, the proportion of fixations directed to the background objects is disproportionately higher in the random model. While subjects looked at the background objects 2% of the time, the random gaze allocation-directed fixations to the background 22.5% of the time. Similarly, the proportion of gaze directed toward the salient objects in the “salient” condition according to the random model is 15% while subjects only looked 0.2% of the time at the salient objects during the task. 
Figure 16
 
Comparison of proportions of fixation times spent on the object classes for human subjects (top), a random gaze allocation model weighted by the spatial gaze distribution (middle), and a saliency model (bottom). The four conditions are shown by column: “pickup,” “avoid,” “combination,” and “salient” conditions.
Figure 16
 
Comparison of proportions of fixation times spent on the object classes for human subjects (top), a random gaze allocation model weighted by the spatial gaze distribution (middle), and a saliency model (bottom). The four conditions are shown by column: “pickup,” “avoid,” “combination,” and “salient” conditions.
The differences in the allocation of gaze according to the saliency model are even more pronounced. First, the proportions of fixation times did not differ significantly for the conditions without salient objects with respect to task conditions ( p = 8.49 two-way ANOVA with repeated measurements) and no significant interaction with object classes was found ( p = .77 two-way ANOVA with repeated measurements). Given that the relevant objects for the task are uniformly colored, not textured, and not in contrast to the average luminosity of the scene, these objects are especially nonsalient. Furthermore, during the task execution, subjects are close to the task-relevant objects, which can cover a large part of the visual field. Such areas of homogeneous color are not deemed salient by the saliency model. Instead, regions in the scene are labeled salient if they have a high contrast in the edge channels and are in contrast to the colors that are dominating the scene. The background contains highly textured buildings and plants with strong contrasts due to the shadows in the scene. It is therefore not surprising that more than 70% of the fixations according to the saliency model are directed toward the background of the scene. Furthermore, in the salient condition, 39% of gaze was directed toward the salient objects in the scene, whereas subjects only spent 0.2% of gaze on these objects. 
Discussion
How does the brain select where to direct the gaze during active purposeful behavior? Most natural behaviors involve multiple tasks. We devised an environment in which the allocation of gaze in component tasks could be systematically investigated. In the context of walking, we examined the allocation of gaze in path following, obstacle avoidance, and target approach. The overwhelming determinant of the subjects' visual behavior was the ongoing component task. Several different measures of the behavior all demonstrated a consistent influence of the task across subjects. 
Task-weighted fixation proportions
The proportion of time spent on the target, the obstacle, and the walkway changes dependent on the task instructions. Previous work has demonstrated that fixations are directed to task-relevant objects and locations within a scene (Ballard et al., 1995; Hayhoe et al., 2003; Johansson et al., 2001; Land, 2004) when these are needed. Shinoda, Hayhoe, and Shrivastava (2001) also observed that the distribution of fixations depends on the goal. It has also been observed, that at the beginning of a particular trial, when subjects have not yet inspected the scene, a number of fixations are used to explore the entire environment (Hayhoe et al., 2003; Land et al., 1999), presumably to build up a spatial representation of the surroundings. Similarly, when subjects were immersed in the scene in the current experiments, they first executed a sequence of fixations that were distributed almost equally between the different object classes. But when walking along the walkway and interacting with the pickup objects and obstacles, fixations were almost entirely directed toward the relevant objects necessary for the ongoing tasks. 
It is of course not possible to control the allocation of attention to the different component tasks. For example, subjects usually avoided obstacles, as expected for overlearned behavior, and obstacles were always being fixated, even when the instructions were to pickup litter objects. Moreover, in all conditions subjects have to navigate down the walkway. It has been shown that gaze in human locomotion can utilize optic flow (Warren, Kay, Zosh, Duchon, & Sahuc, 2001), landmarks, or scale changes (Schrater, Knill, & Simoncelli, 2001) for navigation, and that the degree to which these are utilized depends on the task and affects the distribution of gaze accordingly (Turano, Yu, Hao, & Hicks, 2005). Thus, a certain proportion of fixations in all conditions can be interpreted as being used to aid in navigating along the walkway and is therefore not only determined by the tasks of picking up or avoiding. An added complication in mapping fixations onto tasks is that the different tasks require gaze in different ways and for different durations. Consequently, a large part of the gaze time was spent on pickup objects, which may reflect that gaze was maintained on the object until contact with the body was made whereas gaze was departed from obstacles once avoidance could be assured. 
Task-sensitive feature statistics
The different interactions with the objects in the scene that are required by the tasks also determine the statistics of features at the fixation location. The density of horizontal edges was different for fixations directed toward objects that were approached versus objects that were avoided. It has been shown that image feature properties at fixation location are different from those obtained from randomly chosen image locations (Mannan et al., 1996; Parkhurst & Niebur, 2003; Reinagel & Zador, 1999; Tatler et al., 2005), but it is likely that this results from object properties rather than from low-level image properties which are by themselves attracting visual attention (Einhäuser & König, 2003). The current experiments clearly show that most fixations is directed toward regions of the scene which are uniform in contrast, have low edge density, and contain no color contrast. More important, the density of horizontal edges is different depending on how subjects interacted with the object: If an object was approached, the edge filter response of the fixated image patch was significantly lower that when an obstacle was avoided. This almost certainly reflected the information being extracted for controlling the body. For example, Johansson et al. (2001) showed that subjects fixate the edge of an obstacle to be avoided by a hand movement. Thus, one cannot make conclusions about the significance of image properties at fixation in the absence of a known task context. 
The primary influences of the task on the selection of fixation targets persisted also during the trials in which a large number of salient objects were introduced into the scene. Although regions of the image that score high on saliency measures have been shown to be more likely to be selected during the first few fixations in picture viewing (Parkhurst & Niebur, 2003), the current experiments show that this result cannot be generalized to extended navigation tasks such as the one considered here. Interestingly, subjects did fixate these objects of high color and luminosity contrast but only during the initial and the final part of the trial, that is, during time in which they were no asked to execute a specific task. This result suggests, that previous studies in which human gaze was directed toward salient stimuli (Parkhurst & Niebur, 2003) or to new objects (Hillstrom & Yantis, 1994) or in which attention was captured by transients (Remington, Johnston, & Yantis, 1992) under passive viewing conditions while subjects viewed stimuli presented on a screen may be reconsidered under different task conditions such as extended visuomotor tasks. 
Context dependence of gaze
A novel finding was the influence of context on gaze location as demonstrated in Figure 8. Independently of the task instructions, subjects fixated the pickup and obstacle objects in proportion to their areal extent in the field of view. How can this be explained? In this study, a color histogram was used as a global descriptor of the scene. The color histogram is a statistical measure that quantifies the distribution of colors within the field of view of the subject at each moment in time. First, color has been demonstrated to be a powerful cue in natural scene recognition (Oliva & Schyns, 2000; Wichmann, Sharpe, & Gegenfurtner, 2002) in the tasks of picture viewing. Perhaps the simplest explanation is that if a pickup object covers almost the entire field of view in the head mounted display, only limited alternatives as gaze targets exist. Secondly, because the colors in the virtual environment reflect object categories, the proportion of a color in the color histogram directly reflects the distance to the closest object, as well as the number of objects in the current field of view. Thus, if there are no litter objects close by, subjects will not be able to gaze at litter but will instead look at obstacles, even if the instructions are to pickup litter. In addition, more complex contextual effects could be influencing gaze (e.g., Chun & Jiang, 1998). Torralba et al. (2006) recently demonstrated, that in a search task, contextual features can be extracted from a large number of labeled images in order to characterize the likelihood of objects being present at specific points in an image. 
Gaze sharing
It was also observed that subjects take advantage of the layout of the scene in directing their path resulting in a decrease in the number of fixations directed toward the walkway when the target objects were confined to the walkway. It is possible that subjects might use peripheral vision to guide position on the walkway. However, these results argue instead that the fixations on objects do double duty and serve to aid navigation as well. If subjects used peripheral vision for path control, it would not be expected that the walkway fixations would depend on the placement of the obstacles and the litter objects. Instead, subjects reduced the proportion of walkway fixations from 12% to 4% when objects were confined to the walkway. This demonstrates that the fixational time spent on the walkway in navigation depends on the layout of the scene and that subjects can take advantage of the fact that the walkway position can be inferred from the obstacle and the target positions. 
Comparison with gaze allocation models
The proportion of fixations observed was compared to a specific saliency model as proposed by Itti and Koch (2000). This saliency model did not predict the proportion of fixation on the different object classes. Moreover, in the experiments, it was observed that the target features were dependent on the ongoing task; that is, the type of action the subject was executing. When navigating toward the object subjects fixated the center of the pickup objects resulting in a regions with low spatial frequency content being fixated whereas when navigating around an obstacle subjects tended to fixate the edge of the obstacle around which they were navigating. Current saliency models do not have sufficient complexity to tackle such subtle task-dependent differences in fixation strategy. An advantage of the present paradigm is that it can be directly compared to gaze allocation models, such as that introduced by Sprague and Ballard (2003), which describes a task-based model of gaze allocation in the identical environment. Quantifying human gaze allocation in multiple tasks can therefore be used to further develop such models. 
Conclusions
Although task effects on control of gaze in natural environments have been extensively documented, the mechanisms by which gaze is controlled by tasks is not well explored. To more precisely understand how gaze is allocated in the context of complex extended tasks, we carried a series of experiments out in a naturalistic VR environment that supported eye tracking in an HMD. This environment allowed quantitative evaluations of gaze allocation as a function of an ongoing task. These experiments demonstrate that in the execution of extended natural tasks, human gaze is directed toward regions of the visual scene that are determined primarily by the task requirements. Bottom-up saliency is not a good predictor for the direction of human gaze under such circumstances. 
Acknowledgments
This research was supported by National Institutes of Health Grants EY05729 and RR09283. 
Commercial relationships: none. 
Corresponding author: Constantin A. Rothkopf. 
Email: crothkopf@cvs.rochester.edu. 
Address: Meliora Hall, Rochester, NY 14627. 
References
Baddeley, R. J. Tatler, B. W. (2006). High frequency edges (but not contrast predict where we fixate: A Bayesian system identification analysis. Vision Research, 46, 2824–2833. [PubMed] [CrossRef] [PubMed]
Ballard, D. Hayhoe, M. M. Pook, P. K. Rao, R. P. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20, 723–767. [PubMed] [PubMed]
Ballard, D. H. (1991). Animate vision. Artificial Intelligence Journal, 48, 57–86. [CrossRef]
Ballard, D. H. Hayhoe, M. M. Pelz, J. B. (1995). Memory representations in natural tasks. Journal of Cognitive Neuroscience, 7, 68–82. [CrossRef]
Brouwer, A. Franz, V. H. Gegenfurtner, K. R. (in press). Differences between fixations when objects are grasped or only viewed.
Buswell, G. T. (1935). How people look at pictures: A study of the psychology of perception in art. Chicago: University of Chicago Press.
Chun, M. M. Jiang, Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36, 28–71. [PubMed] [CrossRef] [PubMed]
Deubel, H. Shimojo, S. Paprotta, I. (1997). The preparation of goal-directed eye movements requires visual attention: Evidence from the line–motion illusion. Perception, 26, 72. [CrossRef]
Droll, J. A. Hayhoe, M. M. Triesch, J. Sullivan, B. T. (2005). Task demands control acquisition and storage of visual information. Journal of Experimental Psychology: Human Perception and Performance, 31, 1416–1438. [PubMed] [CrossRef] [PubMed]
Einhäuser, W. König, P. (2003). Does luminance-contrast contribute to a saliency map for overt visual attention? European Journal of Neuroscience, 17, 1089–1097. [PubMed] [CrossRef] [PubMed]
Findlay, J. M. Gilchrist, I. D. (2003). Active vision: The psychology of looking and seeing.
Flanagan, J. R. Johansson, R. S. (2003). Action plans used in action observation. Nature, 424, 769–771. [PubMed] [CrossRef] [PubMed]
Frens, M. A. Erkelens, C. J. (1991). Coordination of hand movements and saccades: Evidence for a common and separate pathway. Experimental Brain Research, 85, 682–690. [PubMed] [CrossRef] [PubMed]
Glimcher, P. W. (2003). The neurobiology of visual-saccadic decision making. Annual Review Neuroscience, 26, 133–179. [PubMed] [CrossRef]
Hayhoe, M. (2000). Visual routines: A functional account of vision..
Hayhoe, M. Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194. [PubMed] [CrossRef] [PubMed]
Hayhoe, M. M. (2004). Advances in relating eye movements and cognition. Infancy, 6, 267–274. [CrossRef]
Hayhoe, M. M. Shrivastava, A. Mruczek, R. Pelz, J. B. (2003). Visual memory and motor planning in a natural task. Journal of Vision, 3, (1):6, 49–63, http://journalofvision.org/3/1/6/, doi:10.1167/3.1.6. [PubMed] [Article] [CrossRef] [PubMed]
He, P. Y. Kowler, E. (1991). Saccadic localization of eccentric forms. Journal of the Optical Society of America A, Optics and image science, 8, 440–449. [PubMed] [CrossRef] [PubMed]
Henderson, J. M. (2003). Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7, 498–504. [PubMed] [CrossRef] [PubMed]
Henderson, J. M. Brockmole, J. R. Castelhano, M. S. Mack, M. van, R. Fischer,, M. Murray,, W. Hill, R. (2006). Image salience versus cognitive control of eye movements in real-world scenes: Evidence from visual search. Eye movement research: Insights into mind and brain. Oxford, England: Elsevier.
Henderson, J. M. Hollingworth, A. (1999). High-level scene perception. Annual Review of Psychology, 50, 243–271. [PubMed] [CrossRef] [PubMed]
Hillstrom, A. P. Yantis, S. (1994). Visual motion and attentional capture. Perception & Psychophysics, 55, 399–411. [PubMed] [CrossRef] [PubMed]
Itti, L. (2000). Models of bottom-up and top-down visual attention.
Itti, L. (2005). Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. Visual Cognition, 12, 1093–1123. [CrossRef]
Itti, L. Koch, C. (2000). A saliency based search mechanism for overt and covert shifts of attention. Vision Research, 40, 1489–1506. [PubMed] [CrossRef] [PubMed]
Itti, L. Koch, C. Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [CrossRef]
Johansson, R. S. Westling, G. Bäckström, A. Flanagan, J. R. (2001). Eye-hand coordination in object manipulation. Journal of Neuroscience, 21, 6917–6932. [PubMed] [Article] [PubMed]
Kayser, C. Körding, K. P. König, P. (2004). Processing of complex stimuli and natural scenes in the visual cortex. Current Opinion in Neurobiology, 14, 468–473. [PubMed] [CrossRef] [PubMed]
Koch, C. Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. [PubMed] [PubMed]
Kowler, E. Anderson, E. Dosher, B. Blaser, E. (1995). The role of attention in the programming of saccades. Vision Research, 35, 1897–1916. [PubMed] [CrossRef] [PubMed]
Krieger, G. Rentschler, I. Hauske, G. Schill, K. Zetzsche, C. (2000). Object and scene analysis by saccadic eye-movements: An investigation with higher-order statistics. Spatial Vision, 13, 201–214. [PubMed] [CrossRef] [PubMed]
Land, M. Chalupa, L. Werner, J. (2004). Eye movements in daily life. The visual neurosciences. (2, pp. 1357–1368). Cambridge, MA: MIT Press.
Land, M. Hayhoe, M. (2001). In what ways do eye movements contribute to everyday activities? Vision Research, 41, 3559–3566. [PubMed] [CrossRef] [PubMed]
Land, M. Mennie, N. Rusted, J. (1999). The roles of vision and eye movements in the control of activities of daily living. Perception, 28, 1311–1328. [PubMed] [CrossRef] [PubMed]
Land, M. F. Lee, D. N. (1994). Where we look when we steer. Nature, 369, 742–744. [PubMed] [CrossRef] [PubMed]
Land, M. F. McLeod, P. (2000). From eye movements to actions: How batsmen hit the ball. Nature Neuroscience, 3, 1340–1345. [PubMed] [Article] [CrossRef] [PubMed]
Li, W. Piech, V. Gilbert, C. D. (2004). Perceptual learning and top-down influences in primary visual cortex. Nature Neuroscience, 7, 651–657. [PubMed] [Article] [CrossRef] [PubMed]
Liversedge, S. P. Findlay, J. M. (2000). Saccadic eye movements and cognition. Trends in Cognitive Sciences, 4, 6–14. [PubMed] [CrossRef] [PubMed]
Mannan, S. K. Ruddock, K. H. Wooding, D. S. (1996). The relationship between the location of spatial features and those of fixations made during visual examination of briefly presented images. Spatial Vision, 10, 165–188. [PubMed] [CrossRef] [PubMed]
Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman.
Navalpakkam, V. Itti, L. (2005). Modeling the influence of task on attention. Vision Research, 45, 205–231. [PubMed] [CrossRef] [PubMed]
Nelson, J. D. Cottrell, G. W. (2007). A probabilistic model of eye movements in concept formation. Neurocomputing, 70, 2256–2272. [CrossRef] [PubMed]
Oliva, A. Schyns, P. G. (2000). Diagnostic colors mediate scene recognition. Cognitive Psychology, 41, 176–210. [PubMed] [CrossRef] [PubMed]
Oliva, A. Torralba, A. Castelhano, M. S. Henderson, J. M. (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing () –256). Barcelona, Spain.
Olshausen, B. A. Field, D. J. (2005). How close are we to understanding V1? Neural Computation, 17, 1665–1699. [PubMed] [CrossRef] [PubMed]
O'Regan, J. K. Noë, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24, 939–1031. [PubMed] [CrossRef] [PubMed]
Paprotta, I. Deubel, H. Schneider, W. X. Becker,, W. Deubel,, H. Mergner, Th. (1999). Object recognition and goal-directed eye or hand movements are coupled by visual attention. Current oculomotor research: Physiological and psychological aspects. (pp. 241–248). New York: Plenum.
Parkhurst, D. J. Niebur, E. (2003). Scene content selected by active vision. Spatial Vision, 16, 125–154. [PubMed] [CrossRef] [PubMed]
Pelz, J. B. Canosa, R. L. Kucharcyk, D. (2000). Portable eyetracking: A study of natural eye movements. Proceedings of SPIE, Human Vision and Electronic Imaging. ()
Pelz, J. B. Rothkopf, C. van, R. Fischer,, M. Murray,, W. Hill, R. (2007). Oculomotor behavior in natural and man-made environments. Eye movement research: Insights into mind and brain.
Posner, M. I. Cohen, Y. Bouma, H. Bouwhis, D. (1984). Components of visual orienting. Attention and Performance X. Hillsdale: Erlbaum.
Razzaque, S. Swapp, D. Slater, M. Whiton, M. Steed, A. (2002). Redirected walking in space.
Reinagel, P. Zador, A. M. (1999). Natural scene statistics at the centre of gaze. Network, 10, 341–350. [PubMed] [CrossRef] [PubMed]
Remington, R. W. Johnston, J. C. Yantis, S. (1992). Involuntary attentional capture by abrupt onsets. Perception & Psychophysics, 51, 279–290. [PubMed] [CrossRef] [PubMed]
Schrater, P. R. Knill, D. C. Simoncelli, E. P. (2001). Perceiving visual expansion without optic flow. Nature, 410, 816–819. [PubMed] [CrossRef] [PubMed]
Schultz, W. (2000). Multiple reward signals in the brain. Nature Reviews, Neuroscience, 1, 199–207. [PubMed] [CrossRef]
Shinoda, H. Hayhoe, M. M. Shrivastava, A. (2001). What controls attention in natural environments? Vision Research, 41, 3535–3546. [PubMed] [CrossRef] [PubMed]
Simons, D. J. Rensink, R. A. (2005). Change blindness: Past, present, and future. Trends in Cognitive Sciences, 9, 16–20. [PubMed] [CrossRef] [PubMed]
Sprague, N. Ballard, D. H. (2003). Eye movements for reward maximization..
Tanenhaus, M. K. Spivey-Knowlton, M. J. Eberhard, K. M. Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268, 1632–1634. [PubMed] [CrossRef] [PubMed]
Tatler, B. W. Baddeley, R. J. Gilchrist, I. D. (2005). Visual correlates of fixation selection: Effects of scale and time. Vision Research, 45, 643–659. [PubMed] [CrossRef] [PubMed]
Torralba, A. Oliva, A. Castelhano, M. Henderson, J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: The role of global features on object search. Psychological Review, 113, 766–786. [PubMed] [CrossRef] [PubMed]
Triesch, J. Ballard, D. H. Hayhoe, M. M. Sullivan, B. T. (2003). What you see is what you need. Journal of Vision, 3, (1):9, 86–94, http://journalofvision.org/3/1/9/, doi:10.1167/3.1.9. [PubMed] [Article] [CrossRef]
Trommershäuser, J. Maloney, L. T. Landy, M. S. (2003). Statistical decision theory and trade-offs in motor response. Spatial Vision, 16, 255–275. [PubMed] [CrossRef] [PubMed]
Turano, K. A. Yu, D. Hao, L. Hicks, J. C. (2005). Optic-flow and egocentric-direction strategies in walking: Central vs peripheral visual field. Vision Research, 45, 3117–3132. [PubMed] [CrossRef] [PubMed]
Warren, Jr., W. H. Kay, B. A. Zosh, W. D. Duchon, A. P. Sahuc, S. (2001). Optic flow is used to control human walking. Nature Neuroscience, 4, 213–216. [PubMed] [Article] [CrossRef] [PubMed]
Wichmann, F. A. Sharpe, L. T. Gegenfurtner, K. R. (2002). The contributions of color to recognition memory for natural scenes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 509–520. [PubMed] [CrossRef] [PubMed]
Wolfe, J. M. Pashler, H. (1998). Visual search. Attention. London, UK: University College London Press.
Yarbus, A. (1967). Eye movements and vision. New York: Plenum.
Figure 1
 
Subject's view of the walkway from the starting position in three different conditions. Left: “pickup purple litter and avoid blue obstacles” condition with normal spatial distribution of objects on the walkway. Middle: same condition as in the left view but with a tighter distribution of objects. Right: normal spatial distribution of objects on the walkway with additional salient objects in the scene.
Figure 1
 
Subject's view of the walkway from the starting position in three different conditions. Left: “pickup purple litter and avoid blue obstacles” condition with normal spatial distribution of objects on the walkway. Middle: same condition as in the left view but with a tighter distribution of objects. Right: normal spatial distribution of objects on the walkway with additional salient objects in the scene.
Figure 2
 
Center: typical view of the subject during execution of the task showing the walkway with blue obstacles and purple pickup objects, the cityscape in the background. Left: schematic representation of the scene context represented by the color histogram of the scene, which shows the proportion of colors summed over the entire field of view. Right: representation of the image patch at two fixation locations (see text). Note how gaze directed to the central region of a solid object is represented as a solid line whereas gaze directed toward an edge is represented as a succession of stripes.
Figure 2
 
Center: typical view of the subject during execution of the task showing the walkway with blue obstacles and purple pickup objects, the cityscape in the background. Left: schematic representation of the scene context represented by the color histogram of the scene, which shows the proportion of colors summed over the entire field of view. Right: representation of the image patch at two fixation locations (see text). Note how gaze directed to the central region of a solid object is represented as a solid line whereas gaze directed toward an edge is represented as a succession of stripes.
Figure 3
 
Visualization as introduced in Figure 2 of the image patch at the center of gaze for subject M. L. in two different conditions. The X-axis corresponds to the normalized trial duration and the Y-axis is foveal eccentricity in degrees. Top: “pickup purple litter”; bottom: “avoid blue obstacles.” The duration of the entire trial consists of the time the subject spent immersed in the environment and listening to the instructions, the time during the execution of the task, and the time during which the subject has finished the task and is waiting to start the next trial. The figure clearly shows the predominance of fixations on purple objects for the pickup task and a similar predominance of fixations on blue objects in the obstacle avoidance task.
Figure 3
 
Visualization as introduced in Figure 2 of the image patch at the center of gaze for subject M. L. in two different conditions. The X-axis corresponds to the normalized trial duration and the Y-axis is foveal eccentricity in degrees. Top: “pickup purple litter”; bottom: “avoid blue obstacles.” The duration of the entire trial consists of the time the subject spent immersed in the environment and listening to the instructions, the time during the execution of the task, and the time during which the subject has finished the task and is waiting to start the next trial. The figure clearly shows the predominance of fixations on purple objects for the pickup task and a similar predominance of fixations on blue objects in the obstacle avoidance task.
Figure 4
 
Comparison of the gaze over time for 10 subjects labeled S1 to S10 using the visualization introduced in Figure 2. Note that the sequences shown in Figure 3 represent the entire trial duration, whereas the sequences shown here are taken only from the time during the trial in which the subjects were moving. Left: gaze visualization during the execution of the task “pickup purple objects.” Right: gaze visualization during the execution of the task “avoid blue obstacles.”
Figure 4
 
Comparison of the gaze over time for 10 subjects labeled S1 to S10 using the visualization introduced in Figure 2. Note that the sequences shown in Figure 3 represent the entire trial duration, whereas the sequences shown here are taken only from the time during the trial in which the subjects were moving. Left: gaze visualization during the execution of the task “pickup purple objects.” Right: gaze visualization during the execution of the task “avoid blue obstacles.”
Figure 5
 
Proportion of fixation time spent on the object classes across subjects ( n = 19) for four different tasks. The shown proportions are colored according to the color convention depicted in Figure 1; that is, (1) purple represents pickup objects, (2) blue represents obstacles, (3) gray represents the walkway, (4) green represents the lawn and the tree, (5) represents the background buildings, and (6) light green represents the salient distractors. The diagrams were obtained by averaging over the two color conditions. From left to right: “pickup purple litter,” “avoid blue obstacles,” “pickup purple litter and avoid blue obstacles,” “pickup purple litter and avoid blue obstacles” with salient distractors in the scene. Error bars are ±1 SEM across subjects.
Figure 5
 
Proportion of fixation time spent on the object classes across subjects ( n = 19) for four different tasks. The shown proportions are colored according to the color convention depicted in Figure 1; that is, (1) purple represents pickup objects, (2) blue represents obstacles, (3) gray represents the walkway, (4) green represents the lawn and the tree, (5) represents the background buildings, and (6) light green represents the salient distractors. The diagrams were obtained by averaging over the two color conditions. From left to right: “pickup purple litter,” “avoid blue obstacles,” “pickup purple litter and avoid blue obstacles,” “pickup purple litter and avoid blue obstacles” with salient distractors in the scene. Error bars are ±1 SEM across subjects.
Figure 6
 
Horizontal and vertical marginal distributions of gaze targets on litter objects in the “pickup” condition (left) and on obstacles in the “avoid” condition (right). These distributions were obtained using data from all 19 subjects. The plots with targets marked on the objects were obtained from the data of subject M. B. and are representative of the entire data set.
Figure 6
 
Horizontal and vertical marginal distributions of gaze targets on litter objects in the “pickup” condition (left) and on obstacles in the “avoid” condition (right). These distributions were obtained using data from all 19 subjects. The plots with targets marked on the objects were obtained from the data of subject M. B. and are representative of the entire data set.
Figure 7
 
Distribution of horizontal and vertical edge filter responses separated for gaze targeting litter in the “pickup” condition and obstacles in the “avoid” condition together with the saccade targets for the corresponding tasks by one subject. Left: responses of the horizontal and vertical first derivative of Gaussian filters at the target of the saccade directed to litter in the “pickup” condition. Right: filter responses at the target of the saccade directed to obstacles in the “avoid” condition.
Figure 7
 
Distribution of horizontal and vertical edge filter responses separated for gaze targeting litter in the “pickup” condition and obstacles in the “avoid” condition together with the saccade targets for the corresponding tasks by one subject. Left: responses of the horizontal and vertical first derivative of Gaussian filters at the target of the saccade directed to litter in the “pickup” condition. Right: filter responses at the target of the saccade directed to obstacles in the “avoid” condition.
Figure 8
 
Probability of fixating the pickup objects and the obstacles given the proportion of the respective object class in the current field of view. Left: proportion of fixation on a pickup-object (purple) and an obstacle (blue) given the proportion of pickup objects in the current scene (0–0.25, 0.25–0.5, 0.5–0.75, 0.75–1.0). Right: proportion of fixation on a pickup object (purple) and an obstacle (blue) given the proportion of obstacles in the current scene (0–1.0).
Figure 8
 
Probability of fixating the pickup objects and the obstacles given the proportion of the respective object class in the current field of view. Left: proportion of fixation on a pickup-object (purple) and an obstacle (blue) given the proportion of pickup objects in the current scene (0–0.25, 0.25–0.5, 0.5–0.75, 0.75–1.0). Right: proportion of fixation on a pickup object (purple) and an obstacle (blue) given the proportion of obstacles in the current scene (0–1.0).
Figure 9
 
Same representation of the fixation distribution of subject S. G. over time as in Figure 3. Additionally, the color histogram of the scene in the field of view of the subject, as described in the methods section and shown in Figure 2, has been plotted for each of the two trials. Top: the fixations and color histogram for “pickup purple litter objects.” Bottom: the fixations and color histogram for “avoid blue obstacles.”
Figure 9
 
Same representation of the fixation distribution of subject S. G. over time as in Figure 3. Additionally, the color histogram of the scene in the field of view of the subject, as described in the methods section and shown in Figure 2, has been plotted for each of the two trials. Top: the fixations and color histogram for “pickup purple litter objects.” Bottom: the fixations and color histogram for “avoid blue obstacles.”
Figure 10
 
Comparison of proportion of fixations on object classes between the normal condition and the “narrow” condition in which the objects were placed only on the walkway. Top row: “pickup” and “avoid” in the normal condition. Bottom row: “pickup” and “avoid” in the narrow.
Figure 10
 
Comparison of proportion of fixations on object classes between the normal condition and the “narrow” condition in which the objects were placed only on the walkway. Top row: “pickup” and “avoid” in the normal condition. Bottom row: “pickup” and “avoid” in the narrow.
Figure 11
 
Visualization as introduced in Figure 2 of the unrolled gaze for subject NM in the condition “pickup litter objects and avoid obstacles” when additionally a large number of salient objects are present in the scene.
Figure 11
 
Visualization as introduced in Figure 2 of the unrolled gaze for subject NM in the condition “pickup litter objects and avoid obstacles” when additionally a large number of salient objects are present in the scene.
Figure 12
 
Proportion of fixations on object classes for the condition in which a large number of additional salient objects were immersed into the scene. Left: proportion of fixations during the execution of the task. Right: proportion of fixations before and after executing the task.
Figure 12
 
Proportion of fixations on object classes for the condition in which a large number of additional salient objects were immersed into the scene. Left: proportion of fixations during the execution of the task. Right: proportion of fixations before and after executing the task.
Figure 13
 
Average fixation durations on object classes for the four conditions. From left to right: “pickup,” “avoid,” “combination,” and “salient.”
Figure 13
 
Average fixation durations on object classes for the four conditions. From left to right: “pickup,” “avoid,” “combination,” and “salient.”
Figure 14
 
From left to right: an original scene from the sequence recorded from the interaction of the subject with the environment for which the saliency map was calculated. Conspicuity maps for intensity, orientation, and color, and the final saliency map after applying the winter-take-all competitive network to the sum of the conspicuity maps.
Figure 14
 
From left to right: an original scene from the sequence recorded from the interaction of the subject with the environment for which the saliency map was calculated. Conspicuity maps for intensity, orientation, and color, and the final saliency map after applying the winter-take-all competitive network to the sum of the conspicuity maps.
Figure 15
 
Comparison of gaze visualizations obtained from subject S. G. under condition “pickup” (top) with the visualizations obtained using a random gaze allocation (middle) and the described saliency model (bottom).
Figure 15
 
Comparison of gaze visualizations obtained from subject S. G. under condition “pickup” (top) with the visualizations obtained using a random gaze allocation (middle) and the described saliency model (bottom).
Figure 16
 
Comparison of proportions of fixation times spent on the object classes for human subjects (top), a random gaze allocation model weighted by the spatial gaze distribution (middle), and a saliency model (bottom). The four conditions are shown by column: “pickup,” “avoid,” “combination,” and “salient” conditions.
Figure 16
 
Comparison of proportions of fixation times spent on the object classes for human subjects (top), a random gaze allocation model weighted by the spatial gaze distribution (middle), and a saliency model (bottom). The four conditions are shown by column: “pickup,” “avoid,” “combination,” and “salient” conditions.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×