Purchase this article with an account.
Matthias Kümmerer, Thomas Wallis, Matthias Bethge; Extending DeepGaze II: Scanpath prediction from deep features. Journal of Vision 2018;18(10):371. doi: 10.1167/18.10.371.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Predicting where humans choose to fixate can help understanding a variety of human behaviour. The last years have seen substantial progress in predicting spatial fixation distributions when viewing static images. Our own model "DeepGaze II" (Kümmerer et al., ICCV 2017) extracts pretrained deep neural network features from the VGG network from input images and uses a simple pixelwise readout network to predict fixation distributions from these features. DeepGaze II is state-of-the-art for predicting freeviewing fixation densities according to the established MIT Saliency Benchmark. However, DeepGaze II predicts only spatial fixation distributions instead of scanpaths. Therefore, the models model ignores crucial structure in the fixation selection process. Here we extend DeepGaze II to predict fixation densities conditioned on the previous scanpath. We add additional feature maps encoding the previous scanpath (e.g. the distance of image pixels to previous fixations) to the input of the readout network. Except for these few additional feature maps, the architecture is exactly as for DeepGaze II. The model is trained on ground truth human fixation data (MIT1003) using maximum-likelihood optimization. Even using only the last fixation location increases performance by approximately 30% relative to DeepGaze II and reproduces the strong spatial fixation clustering effect reported previously (Engbert et al., JoV 2015). This contradicts the way Inhibition of Return has often been used in computational models of fixation selection. Using a history of two fixations increases performance further and learns a suppression effect around the earlier fixation location. Due to the probabilistic nature of our model, we can sample new scanpaths from the model that capture the statistics of human scanpaths much better than scanpaths sampled from a purely spatial distribution. The modular architecture of our model allows us to explore the effects of many different possible factors on fixation selection.
Meeting abstract presented at VSS 2018
This PDF is available to Subscribers Only