Abstract
Models predicting human gaze positions on still images have been greatly improved in the last years. Since motion patterns are an important factor driving human gaze as well (Rosenholtz 1999, Itti 2005, Dorr et al. 2010), there is growing interest in modelling human gaze positions on videos.
In our work, we explore to what extent human gaze positions on recent video saliency benchmarks can be explained by static features. We apply models that cannot learn temporal patterns by design on the LEDOV and DIEM datasets and compare them to a gold standard model as an estimate of the explainable information. We first consider DeepGaze II (Kümmerer et al. 2017), the current state-of-the-art model for images, by applying it to every frame individually. To incorporate the time lag in human responses we consider two adaptations of DeepGaze II that predict gaze positions on the last frame of a fixed length window: First, we temporally average the predictions of DeepGaze II. Additionally we propose a new model "DeepGaze MR" that temporally averages image features and uses an adapted, nonlinear readout network to predict gaze positions. By design, all model variants are still not able to detect movements, appearances or interactions of objects.
Our new model substantially outperforms previous video saliency models and explains 75% of the information on the LEDOV and 43% on the DIEM dataset. By analyzing failure cases of our model, we find that clear temporal effects on human gaze placement exist, but are rare in the benchmarks considered. Moreover, none of the recent video saliency models considered is able to predict human gaze in those cases better than our static baselines. To foster the data-driven modelling of temporal features affecting human gaze, we propose a meta-benchmark consisting of the hard cases found by our analysis.