October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Measuring the importance of temporal features in video saliency
Author Affiliations & Notes
  • Matthias Tangemann
    University of Tübingen
  • Matthias Kümmerer
    University of Tübingen
  • Thomas S.A. Wallis
    University of Tübingen
    Amazon Research Tübingen (this work was done prior to joining Amazon)
  • Matthias Bethge
    University of Tübingen
  • Footnotes
    Acknowledgements  We acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A) and from the German Science Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, project number 276693517.
Journal of Vision October 2020, Vol.20, 1061. doi:https://doi.org/10.1167/jov.20.11.1061
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Matthias Tangemann, Matthias Kümmerer, Thomas S.A. Wallis, Matthias Bethge; Measuring the importance of temporal features in video saliency. Journal of Vision 2020;20(11):1061. https://doi.org/10.1167/jov.20.11.1061.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Models predicting human gaze positions on still images have been greatly improved in the last years. Since motion patterns are an important factor driving human gaze as well (Rosenholtz 1999, Itti 2005, Dorr et al. 2010), there is growing interest in modelling human gaze positions on videos. In our work, we explore to what extent human gaze positions on recent video saliency benchmarks can be explained by static features. We apply models that cannot learn temporal patterns by design on the LEDOV and DIEM datasets and compare them to a gold standard model as an estimate of the explainable information. We first consider DeepGaze II (Kümmerer et al. 2017), the current state-of-the-art model for images, by applying it to every frame individually. To incorporate the time lag in human responses we consider two adaptations of DeepGaze II that predict gaze positions on the last frame of a fixed length window: First, we temporally average the predictions of DeepGaze II. Additionally we propose a new model "DeepGaze MR" that temporally averages image features and uses an adapted, nonlinear readout network to predict gaze positions. By design, all model variants are still not able to detect movements, appearances or interactions of objects. Our new model substantially outperforms previous video saliency models and explains 75% of the information on the LEDOV and 43% on the DIEM dataset. By analyzing failure cases of our model, we find that clear temporal effects on human gaze placement exist, but are rare in the benchmarks considered. Moreover, none of the recent video saliency models considered is able to predict human gaze in those cases better than our static baselines. To foster the data-driven modelling of temporal features affecting human gaze, we propose a meta-benchmark consisting of the hard cases found by our analysis.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.