Abstract
Neural data and models have proposed that the brain achieves invariant object recognition by learning and combining several views of a three-dimensional object. How such invariant codes are learned when active eye movements scan a scene, given that the cortical magnification introduces a large source of variability in the visual representation even for the same view of the object? How does the brain avoid the problem of erroneously classifying together parts of different objects when an eye movement changes the cortical representation from one to the other? How does the brain differentiate between saccades on the same object and saccades between different objects? A biologically inspired ARTSCAN model of visual object learning and recognition with active eye movements proposes answers to these questions. The model explains how surface attention interacts with eye movement generating modules and object recognition modules so that the views that correspond to the same object are selectively clustered together. This interaction does not require prior knowledge of object identity. The modules in the model conform to brain regions in the What and Where cortical streams of the visual system. The What stream learns a spatially-invariant and size-invariant representation of an object, using bottom-up filtering and top-down attentional mechanisms. The Where stream computes indices of object location and guides attentive eye movements. Preprocessing occurs in the primary visual areas, notably log-polar compression of the periphery, contrast enhancement, and parallel processing of boundary and surface properties. ARTSCAN was tested on a scene filled with letters of different sizes and orientations and performed above 95% correct in classification after real-time incremental learning controlled by attention shifts and active eye movements.
Supported in part by the National Science Foundation (NSF SBE-0354378) and the Office of Naval Research (ONR N00014-01-1-0624).