September 2018
Volume 18, Issue 10
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2018
Extending DeepGaze II: Scanpath prediction from deep features
Author Affiliations
  • Matthias Kümmerer
    Werner-Reichardt-Centre for Integrative Neuroscience, University Tübingen
  • Thomas Wallis
    Werner-Reichardt-Centre for Integrative Neuroscience, University TübingenBernstein Center for Computational Neuroscience, Tübingen
  • Matthias Bethge
    Werner-Reichardt-Centre for Integrative Neuroscience, University TübingenBernstein Center for Computational Neuroscience, Tübingen
Journal of Vision September 2018, Vol.18, 371. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Matthias Kümmerer, Thomas Wallis, Matthias Bethge; Extending DeepGaze II: Scanpath prediction from deep features. Journal of Vision 2018;18(10):371.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Predicting where humans choose to fixate can help understanding a variety of human behaviour. The last years have seen substantial progress in predicting spatial fixation distributions when viewing static images. Our own model "DeepGaze II" (Kümmerer et al., ICCV 2017) extracts pretrained deep neural network features from the VGG network from input images and uses a simple pixelwise readout network to predict fixation distributions from these features. DeepGaze II is state-of-the-art for predicting freeviewing fixation densities according to the established MIT Saliency Benchmark. However, DeepGaze II predicts only spatial fixation distributions instead of scanpaths. Therefore, the models model ignores crucial structure in the fixation selection process. Here we extend DeepGaze II to predict fixation densities conditioned on the previous scanpath. We add additional feature maps encoding the previous scanpath (e.g. the distance of image pixels to previous fixations) to the input of the readout network. Except for these few additional feature maps, the architecture is exactly as for DeepGaze II. The model is trained on ground truth human fixation data (MIT1003) using maximum-likelihood optimization. Even using only the last fixation location increases performance by approximately 30% relative to DeepGaze II and reproduces the strong spatial fixation clustering effect reported previously (Engbert et al., JoV 2015). This contradicts the way Inhibition of Return has often been used in computational models of fixation selection. Using a history of two fixations increases performance further and learns a suppression effect around the earlier fixation location. Due to the probabilistic nature of our model, we can sample new scanpaths from the model that capture the statistics of human scanpaths much better than scanpaths sampled from a purely spatial distribution. The modular architecture of our model allows us to explore the effects of many different possible factors on fixation selection.

Meeting abstract presented at VSS 2018


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.