Abstract
The DeepGaze III model currently sets the state-of-the-art in predicting free-viewing human scanpaths on natural images by predicting future fixations from the observed image and recent fixation locations. Inspired by gain control mechanisms in Neuroscience, we introduce gain control layers into the network architecture which can modulate the activity in certain channels of the network depending on additional factors, such as observer biases or search targets. By comparing the prediction performance of the baseline model with the performance of such an extended model in terms of information gain, we can quantify the amount of information that additional factors contribute to fixation placement. Due to the modular DeepGaze III architecture, we can decompose the information gain into different components: (1) a first component affecting only the modulation amplitude of the fixation distribution, (2) a second component modulating which image features are salient, and (3) a third component affecting the scanpath dynamics. Applying this approach, we quantify how much a fixation’s index in a scanpath, subject identity and search targets affect scanpaths in free-viewing and visual search. For free-viewing, we find that fixation index and subject identity contribute to a similar degree to fixation placement. In the case of fixation index, this information is equally split into a part making the fixation density more uniform over time, and a part changing which image features are salient. The contribution of subject identity is mostly due to different subjects preferring different image features. For visual search on the COCO Search18 dataset, the search target increases the explained information by 18% compared to just the presented image, suggesting substantial similarities in fixation behavior across targets. Our work demonstrates how contrast gain control can be used as a very general and sample-efficient mechanism to flexibly modify neural network computation to account for additional factors of interest.