Abstract
Due to the technical challenges of real-world eye tracking, most of our knowledge about how we direct our gaze in scenes comes from laboratory studies requiring observers to remain still while viewing static images for a prolonged period of time. How do we direct our gaze when walking through a familiar environment? What is the relationship between gaze and the spatial distribution of low and high-level visual structure under these conditions? Eleven participants recorded first-person videos with binocular eye tracking (Pupil Labs) while repeatedly walking around a college campus pond over the course of a year. Each of the 97 walks commenced with a 16-point calibration procedure and was followed by a validation procedure to estimate gaze error. We identified four geographic regions of interest within the pond environment, and extracted two-second clips from each region in each walk. For each of the first-person video frames, we computed two popular models of gaze behavior: saliency (Itti & Koch, 2001) and meaning maps (Henderson & Hayes, 2017). Further, maps of scene aesthetics were created by asking 100 observers on Prolific to click on the ten most beautiful regions of each image. We quantified the spatial spread of each of these three maps using entropy. As expected, all maps were broadly distributed across each frame (saliency: 21.2 bits; beauty: 20.2 bits; meaning: 21.5 bits). By contrast, we found that gaze behavior within the frames was very focal, often confined to a narrow region on the path in front of the participant. The entropy of the gaze was markedly lower than the predictors: 14.8 bits. The current results suggest that although saliency and meaning form important predictors of gaze for stationary observers, they do not contribute strongly to active vision, which may rely more on navigationally-relevant features.