Abstract
A remarkable property of biological visual systems is their ability to infer and represent invariances in the visual environment. This information is important for determining ‘what’ we are seeing- i.e. recognizing objects and interpreting scenes. However, such a representation only addresses half of the story: the variant part, such as the motion of an object, captures the ‘where’ or ‘how’ information which is equally important for interpreting and interacting with the environment. Therefore, a complete visual representation should capture both the invariant and variant parts of images. Here we present a model that learns to separate the variant from the invariant part of time varying natural images. First, we reformulate the sparse coding model [Olshausen and Field, 1996] so that images are explained in terms of a multiplicative interaction between two sets of causal variables. One set of variables is constrained to change slowly over time (the invariant representation), and another set of variables is allowed to change quickly over time and is encoded as a phase angle (the variant representation). After training on natural image sequences, the learned basis functions are similar to those produced by the original sparse coding model — i.e., a set of Gabor-like functions that are spatially localized, oriented and bandpass. In this case, though, the multiplicative decomposition produces both invariant components with slowly changing responses, representing aspects of visual shape, and variant components in the form of phase angles precessing over time, representing their transformations. The model predicts the existence of two classes of cells in primary visual cortex that form the beginnings of a ‘what’ and ‘where’ representation of images. Moreover, the decomposition provided by this model paves the way toward the construction of hierarchical models for capturing more global aspects of the ‘what’ and ‘where’ structure in natural images.
NGA grant MCA 015894-UCB, NSF grant IIS-06-25223