In a stationary condition, where the target’s position and identity do not change over time, each saccade thus provides a new viewpoint over the scene, allowing one to form a new estimation of the target identity. Following the active inference setup (
Najemnik & Geisler, 2005;
Friston et al., 2012), we assume that, instead of trying to detect the actual position of the target, the agent tries to maximize the counterfactual benefit in scene understanding that would be gained by any potential saccade. The focus is thus put on action selection metric rather than spatial representation. This means in short estimating how accurate a categorical target classifier will be after moving the eye. In a full setup, predictive action selection means first predicting the future visual field, denoted
\(x^{\prime }\), which is obtained at the center of fixation, and then predicting how good the estimate of the target identity, denoted
\(y\), that is,
\(p(y|x^{\prime })\), will be at this location. In practice, predicting a future visual field over all possible saccades is too computationally expensive. Better off instead is to record, for every context
\(x\), the improvement obtained in recognizing the target after different saccades
\(a, a^{\prime }, a^{\prime \prime }, \ldots\). If
\(a\) is a possible saccade and
\(x^{\prime }\) the corresponding future visual field, the result of the central categorical classifier over
\(x^{\prime }\) can either be correct (1) or incorrect (0). If this experiment is repeated many times over many visual scenes, the probability of correctly classifying the future visual field
\(x^{\prime }\) from
\(a\) is a number between 0 and 1 that reflects the frequency of correct classifications. The putative effect of every saccade can thus be condensed in a single number, the
accuracy, that quantifies the final benefit of issuing saccade
\(a\) from the current observation
\(x\). Extended to the full action space
\(A\), this forms an accuracy map that should monitor the selection of saccades. This accuracy map can be trained by trials and errors, with the final classification success or failure used as a teaching signal. Our main assumption here is that such a
predictive accuracy map is at the core of a realistic saccade-based vision system.