Abstract
Humans view images in scanpaths of fixations, where they move their gaze over the image to explore interesting parts of the image. Which factors govern the principles of such scanpaths and how they change over time has been the subject of substantial research. The deep learning based DeepGaze III model currently sets the state-of-the-art in free-viewing scanpath prediction on natural images. It combines a spatial prediction module, which captures the influence of scene content on fixation placement, with a scanpath history module that captures the influence of earlier fixations and therefore the dynamics of the scanpath. Here, we conduct a series of ablation studies to train variants of DeepGaze III with no access to scene content, scanpath history or both and analyse how well fixations are predicted over the course of free-viewing scanpaths. We find that the overall predictability of fixations decays substantially over the course of scanpaths. Comparing the ablated models allows us to attribute this decay nearly completely to a decay in the influence of scene content: early fixations focus on the most salient image areas. Additionally, there is an influence of the initial fixation, which is mostly decayed after one saccade. Interestingly, the effect of scanpath history stays nearly constant over the scanpaths, suggesting that the dynamics of scene viewing don't change substantially over time. Due to the capacity of our models relative to the size of the dataset, we can assume that the models are able to capture all substantive ways in which fixations could depend on image content and previous fixations (in the context of free viewing). This avoids the possible confounder that the models are simply not good enough to predict fixations well. This study provides an example of how deep learning based models can be used to further our understanding of human behaviour.