Abstract
Eye movements on natural scenes are driven by image content as well as by saccade dynamics and sequential dependencies. Recent research has seen a variety of models that aim to predict time-ordered fixation sequences, including statistical-, mechanistic-, and deep neural network (DNN) models, each with their own advantages and shortcomings. Here we show how a synthesis of different modeling frameworks may offer fresh insights into the underlying processes. Firstly, the explanatory power of biologically inspired models can help develop an understanding of mechanisms learned by DNNs. Secondly, DNN performance can be used to estimate data predictability and thereby help uncover new mechanisms. DeepGaze3 (DG3) is currently the best-performing DNN model for scan path predictions (Kümmerer & Bethge, 2020); SceneWalk (SW) is the best-performing biologically inspired dynamical model (Schwetlick et al., 2021). Both models can be fitted using maximum likelihood estimation and compute per-fixation likelihood predictions. Thus, we can analyze prediction divergence at the level of individual fixations. DG3 generally outperforms SW, indicating that the DNN is accounting for variance by learning mechanisms that are not yet included in the mechanistic SW model. Preliminary results show that SW tends to underestimate the probability of long, explorative saccades. In SW this behavior could be achieved by replacing the Gaussian attention span with a function with heavier tails or by implementing temporal attention span fluctuation. Furthermore, DG3 appears to compress previously unexplored areas, increasing likelihood for saccades to the region center. Once the region is fixated, DG3 broadens the local probability, consistent with a dualistic exploration-exploitation strategy. Adding corresponding mechanisms to SW may improve model performance and help develop more advanced dynamical models. Finding the synergies between different modeling approaches, specifically high-performing DNNs and more transparent dynamical models, is a valuable tool for improving our understanding of fixation selection during scene viewing.