Abstract
Stable visual perception during eye and body movements suggests neural algorithms that convert location information—"where” type of signals—across multiple frames of reference, for instance, from retinocentric to craniocentric coordinates. Accordingly, numerous theoretical studies have proposed biologically plausible computational processes to achieve such transformations. However, how coordinate transformations can then be used by the hierarchy of cortical visual areas to produce stable perception remains largely unknown. Here, we explore the hypothesis that perceptual stability equates to robust classification of visual features relative to movements, that is, a “what” type of information processing. We demonstrate in CNNs that neural signals related to eye and body movements support accurate image classification by making “where” type of computations for coordinate transformations faster to learn and more robust relative to input perturbations. Accordingly, movement signals contributed to the emergence of activity manifolds associated with image categories in late CNN layers and to movement-related response modulations in network units as observed experimentally during saccadic eye movements. Therefore, by equating perception to classification, we provide a simple unifying computational framework to explain the role of movement signals in support of stable perception in dynamic interactions with the environment.