Abstract
What jumps out in a single glance of an image is different than what you might notice after closer inspection. Despite this, current computational models of visual saliency predict human gaze patterns at an arbitrary, fixed viewing duration (one image: one saliency map). This offers a limited view of the rich interactions between image content and gaze, and obscures the fact that different image content might be salient at different time points. In this paper we propose to capture gaze as a series of snapshots (one image: multiple saliency maps). Rather than aggregating individual scanpaths, we directly generate population-level saliency heatmaps for multiple viewing durations. Towards this goal, we turn to CodeCharts UI, a cost-effective interface for crowdsourcing gaze data without requiring an eye tracker. This interface provides precise control over timing, which allows us to gather attention patterns at different viewing durations. We collect the CodeCharts1K dataset with attention data for 0.5, 3, and 5 seconds of free-viewing on images from action, memorability, and out-of-context datasets. We find that gaze locations differ significantly across the three viewing durations but are consistent across participants within a duration, leading to multiple distinct heatmaps per image. Using insights from our analysis of human gaze data, we develop a temporally-aware deep learning model of saliency that simultaneously trains on data from multiple viewing durations. Our computational model achieves competitive performance on the LSUN 2017 Saliency Prediction Challenge when tested at the same viewing duration used for collecting the ground-truth human data. Importantly, our model also simultaneously produces predictions at multiple viewing durations. We discuss how knowing what is salient over different viewing windows can be used for image cropping, compression, and captioning applications.