In the mid-20th century, careful recordings confirmed that the human eyes are never still—they continue to move even when we try to maintain steady gaze on a point. These movements are far from negligible, shifting stimuli quite rapidly across many receptors, especially in the fovea, where cones are most densely packed (
Figure 1). Importantly, these movements do not reflect unavoidable noise (
Barlow, 1952), and it was soon observed that they represent a form of slow control (
Steinman, Haddad, Skavenski, & Wyman, 1973). Additionally, it was discovered that vision tends to fade away when stimuli are immobilized on the retina (
Steinman & Levinson, 1990) and that retinal neurons are most strongly sensitive to temporal luminance changes (
Lee, 1996). These observations are at the foundation of the so-called dynamic theories of vision, which argue that the perception of spatial relationships relies on luminance changes induced by both eye movements and environmental changes, all encoded by a moving retina (
Ahissar & Arieli, 2001;
Rucci, Ahissar, & Burr, 2018;
Rucci & Victor, 2015). In this view, information about space is encoded in the form of a spatiotemporal code. This means that the temporal structure of activation of receptors is as important as their spatial location.
Although the first proposals that vision relies on changes rather than stationary images date back over a century with the dynamic theories of visual acuity (
Arend, 1973;
Averill & Weymouth, 1925;
Marshall & Talbot, 1942), these theories lost traction in later decades when they seemed to be contradicted by experiments that attempted to eliminate retinal image motion (
Riggs, Ratliff, Cornsweet, & Cornsweet, 1953;
Tulunay-Keesey & Jones, 1976). In parallel, the rise of a reductionist approach in vision research aimed at elucidating spatial processing led to a focus on static conditions dominated by studies in anaesthetized animals and unnatural conditions of sustained fixation in humans. Unfortunately, during this shift, the previous insights were forgotten, and vision came to be hypothesized as based solely on spatial coding—the so-called camera model.
This model, which relies on a hierarchy of spatial processing operations, has become the standard textbook model despite the conceptual difficulties and the implausible assumptions it entails. Gur’s Perspective provides an excellent example, with its assumptions that the eye’s landing after a saccade creates a flash-like imprint of the image on the retina, allowing for spatial-only decoding of image details, akin to a camera. But saccades are not instantaneous, the stimulus moves over the retina both before and after saccade landing, and more so in the presence of normal head and body movements, there is no “shutter” in the visual system to freeze the image, and retinal integration times are on the order of tens of milliseconds. In other words, during natural viewing, there is no moment in which the visual system experiences “a frozen input,” as in the case of simplified computer simulations.
A critical observation that was ignored both in the 1970s and in Gur’s Perspective is that the pioneering experiments that supposedly contradicted dynamic theories suffered from serious technological and methodological limitations (
Kelly, 1979). In the last 20 years or so, accurate measurements of eye movements, carefully tailored experiments with humans, neurophysiological results in macaques and humans, and computational analyses, as well as comparisons with other dynamical sensory modalities, have revived and expanded the dynamic theory of vision. This theory takes on different forms, but its core concept remains the same: visual encoding is a spatiotemporal process and eye movements play a crucial role to this process.