Abstract
Most approaches to image recognition focus on the problem of inferring a categorical label or action code from a static image, ignoring dynamic aspects of appearance that may be critical to perception. Even methods that examine behavior over time, such as in a video sequence, tend to label each image frame independently, ignoring frame-to-frame dynamics. This viewpoint suggests that it is time-independent categorical information that is important, and not the patterns of actions that relate stimulus configurations together across time. The current work focuses on face perception and demonstrates that there is important information that can be extracted from pairs of images by examining how the face transforms in appearance from one image to another. Using a biologically plausible neural network model called a conditional Restricted Boltzmann Machine that performs unsupervised Hebbian learning, we show that the network can infer various facial actions from a sequence of images (e.g., transforming a frown into a smile or moving the face from one location of the image frame to another). Critically, after inferring the actions relating two face images from one individual, the network can apply the transformation to a test face from an unknown individual, without any knowledge of facial identity, expressions, or muscle movements. By visualizing the factors that encode and break down facial actions into a distributed representation, we demonstrate a kind of factorial action code that the network learns in an unsupervised manner to separate identity characteristics from rigid (affine) and non-rigid expression transformations. Models of this sort suggest that neural representations of action can factor out specific information about a face or object such as its identity that remain constant from its dynamic behavior, both of which are important aspects of perceptual inference.