Free
Research Article  |   December 2008
A recurrent dynamic model for correspondence-based face recognition
Author Affiliations
Journal of Vision December 2008, Vol.8, 34. doi:https://doi.org/10.1167/8.7.34
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Philipp Wolfrum, Christian Wolff, Jörg Lücke, Christoph von der Malsburg; A recurrent dynamic model for correspondence-based face recognition. Journal of Vision 2008;8(7):34. https://doi.org/10.1167/8.7.34.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Our aim here is to create a fully neural, functionally competitive, and correspondence-based model for invariant face recognition. By recurrently integrating information about feature similarities, spatial feature relations, and facial structure stored in memory, the system evaluates face identity (“what”-information) and face position (“where”-information) using explicit representations for both. The network consists of three functional layers of processing, (1) an input layer for image representation, (2) a middle layer for recurrent information integration, and (3) a gallery layer for memory storage. Each layer consists of cortical columns as functional building blocks that are modeled in accordance with recent experimental findings. In numerical simulations we apply the system to standard benchmark databases for face recognition. We find that recognition rates of our biologically inspired approach lie in the same range as recognition rates of recent and purely functionally motivated systems.

Introduction
Over the past decades the task of visual object recognition has emerged as an intriguing and difficult scientific problem. Its many facets have been studied within and across disciplines such as physics, mathematics, biology, psychology, and computer science. Biological vision has in this context long been a source of inspiration for solutions to technical vision tasks because of its far-reaching capabilities that are unmatched by artificial systems so far. For instance, computational vision systems possess multiple layers of processing, which have been inspired by and are regarded as analogous to the multiple stages of processing in the visual system of animals and humans. In mammals the primary visual cortex (V1), the secondary visual cortex (V2), and the inferotemporal cortex (IT) constitute such stages. These different layers process and exchange information to infer knowledge from a given input. An important open question we address in this paper is how the information transfer between the stages of processing is organized. Models of visual object recognition can differ significantly in this respect. They range from systems with static connectivity for information transmission between layers (e.g., LeCun, Huang, & Bottou, 2004; Mel, 1997; Riesenhuber & Poggio, 1999; Rosenblatt, 1961), over systems that moderately manipulate transmission (e.g., Grimes & Rao, 2005; Walther, Itti, Riesenhuber, Poggio, & Koch, 2002), to systems that employ explicit mechanisms to dynamically modulate information transmission between any two stages (e.g., Arathorn, 2002; Hinton, 1981; Kree & Zippelius, 1988; Lücke, Keck, & von der Malsburg, 2008; Olshausen, Anderson, & Van Essen, 1993; Weber & Wermter, 2007; Wiskott & von der Malsburg, 1996). Likewise, models range from systems that process information in pure feed-forward fashion (e.g., LeCun et al., 2004; Mel, 1997; Riesenhuber & Poggio, 1999; Rosenblatt, 1961) over systems with little or intermediate recurrence (e.g., Walther et al., 2002) to models with fully recurrent processing (e.g., Lücke et al., 2008; Wiskott & von der Malsburg, 1996). Systems with static connectivity for information transmission between processing stages are usually feed-forward whereas systems with dynamic manipulation of information transmission are usually recurrent. 
The great diversity of models of visual recognition is, at least partly, due to different strategies in addressing the problem of invariance—a problem that is central to object recognition in general. In one class of systems, here referred to as feature-based systems, invariance is achieved by letting signals from feature detectors at different positions (and scale and orientation) in an earlier layer converge onto one or a few units in a later layer (e.g., LeCun et al., 2004; Mel, 1997; Riesenhuber & Poggio, 1999; Rosenblatt, 1961; Walther et al., 2002). This strategy, often referred to as pooling, preserves feature information but not position information. Feature pooling is usually implemented in static architecture with an appropriate pooling operation such as selecting the maximal incoming signal (Riesenhuber & Poggio, 1999). Another class of systems, here referred to as correspondence based (cf. Zhu & von der Malsburg, 2004), achieves invariance by actively routing signals through rapidly changing connections, transmitting information not only about the identity of features but also about their spatial arrangement, and mapping the structure of an object into an object-centered frame of reference (e.g., Anderson, Essen, & Olshausen, 2005; Arathorn, 2002; Hinton, 1981; Kree & Zippelius, 1988; Olshausen et al., 1993; Wiskott & von der Malsburg, 1996). 
Feature-based systems and correspondence-based systems have been successful in different application domains (see Biederman & Kalocsai, 1997, for a discussion). Feature-based systems (e.g., LeCun et al., 2004; Riesenhuber & Poggio, 1999) are successful in classification tasks (but see Pinto, Cox, & Dicarlo, 2008, for a critical discussion). In these tasks the relative insensitivity of feature-based systems to small metric variation of object parts is advantageous. Also, the static connectivity in feature-based systems allows them to be tuned to specific image databases. A problem for these systems is, however, a strong sensitivity to background effects (as discussed, e.g., in Zhu & von der Malsburg, 2004, or Lücke et al., 2008), which often requires an additional segmentation mechanism. 
Correspondence-based systems on the other hand prevail in recognition tasks in which small differences in features and their arrangements are important. A typical such task is face recognition. Face recognition has attracted much attention because it is at the focus of human attention and is of high commercial value. Furthermore, much is known about it in terms of psychophysics and neurophysiology. Compared to other tasks, an important point for the purposes of this paper is, in addition, the existence of stiff competitive tests on widely available image galleries (e.g., Messer et al., 2004; Phillips et al., 2005; Phillips, Moon, Rizvi, & Rauss, 2000). In the most comprehensive commercial test (Phillips, Grother, Micheals, Blackburn, & Tabassi, 2003, see also www.frvt.org), the best performing systems all used active mechanisms to establish correspondences between input and a memory. 
Other than having different degrees of success in different application domains, some types of data about the mammalian visual systems seem to be explained better by feature-based systems whereas other data are more convincingly explained by correspondence-based models. A main argument for feature-based feed-forward recognition has, for instance, been the processing speed of the human visual cortex. Thorpe and coworkers have shown (Thorpe, 1988; Thorpe, Fize, & Marlot, 1996) that humans can decide whether an image contains an animal or not in less than 150 ms. In the area of face recognition, Debruille, Guillem, and Renault (1998) found that event-related potentials (ERPs) in response to novel vs. known faces start to differ as early as 76 to 130 ms. Since such times are not much longer than the time required for a first wave of spikes to travel through the ventral stream after presentation of an image, it has been argued that visual recognition must be feed-forward. However, with closer inspection, such an interpretation seems to capture only part of the story. For instance, population codes can increase the speed of information transmission. The average spike rate of large excitatorily coupled neuron populations can be read out on a timescale that is much faster than the average spike latency of their single constituting neurons (van Vreeswijk & Sompolinsky, 1998). Thus, networks that have such “high gain” connectivity can respond very sensitively to subtle and fast input changes (similar to the principle of criticality; Bak, 1996). Our model makes use of such a population code (see A dynamic model of cortical columns section). Furthermore, independently of population coding, correspondence-based systems can react very fast if their dynamic connections have already been primed for a specific stimulus. This might be the case in simple classification tasks like in the experiments of Thorpe (1988). 
A further argument in favor of correspondence-based mechanisms is psychophysical experiments that found priming or congruency effects to play a strong role (for a review, see Graf, 2006). Recognition performance and reaction times depend on the primed orientation (Jolicoeur, 1985; Lawson & Jolicoeur, 1999) or size of objects (Bundesen & Larsen, 1975). These findings show that it does take effort and time to align the external world with internal representations, suggesting active dynamic processes for correspondence finding rather than passive pooling operations. There is also abundant physiological evidence that connectivities in the visual system are not static. Shifting receptive fields have been found in lateral intraparietal cortex (Duhamel, Colby, & Goldberg, 1992; Kusunoki & Goldberg, 2003), in MT (Womelsdorf, Anton-Erxleben, Pieper, & Treue, 2006), and even in V2 and V4 (Luck, Chelazzi, Hillyard, & Desimone, 1997), suggesting that effective receptive fields in the visual system change from one instance to the next to route and match the current stimulus of interest to representations in memory. Combining the evidence for feed-forward on the one hand and correspondence-based processes on the other it appears likely that the brain employs both strategies (as argued for example by Yuille & Kersten, 2006). This could be in the form of a first feed-forward sweep, which is fast and unconscious, and a second step of in-depth recurrent processing, which leads to conscious perception (compare Lamme, 2003). This view is consistent with findings by Johnson and Olshausen (2003), who report two ERP signals related to object recognition, an early presentation-locked one and a later signal that correlates in timing with the response times for recognition. 
In summary, we have discussed three sources of evidence that argue for the existence of correspondence-based mechanisms:
  1.  
    the necessity to detect delicate differences between features and their arrangements in functional applications,
  2.  
    psychophysical findings suggesting active coordinate transformations taking place during visual information processing, and
  3.  
    evidence from physiological data about the mammalian visual system showing quickly changing neuronal receptive fields.
To help integrating these research directions we propose in this paper a correspondence-based model that is both neurally inspired and functionally competitive. Our model is similar to neural network approaches to correspondence-based recognition such as shifter circuits (Anderson et al., 2005; Olshausen et al., 1993) and dynamic link matching (Lades et al., 1993; Wiskott & von der Malsburg, 1996; Würtz, 1995). The neural dynamics used in our system is similar to the one recently suggested in Lücke et al. (2008)—a system that advances earlier approaches in that it computes feature similarities neurally and is fast in terms of recognition times. The system in Lücke et al. (2008) is only applicable to the correspondence problem, however, whereas the system discussed in this paper is capable of full recognition.
Our paper is organized as follows. In the Correspondence-based recognition section we briefly review the correspondence problem, discuss different neurally inspired approaches to correspondence-based recognition, and present the basic architecture of our model. In A dynamic model of cortical columns section we define and discuss the basic computational element of our system and in The network model section their connectivity and dynamic interaction. The model is simulated and applied to face-recognition databases in the Simulations section. The Discussion section discusses the system and the simulation results in the context of experimental findings and other systems in the literature. 
Correspondence-based recognition
Central to correspondence-based systems is the so-called correspondence problem, which is illustrated in Figure 1. Figure 1A contains two stick figures as images, one on the input side and one on the model side. Both of these images are represented by a layer of feature detectors or nodes (black circles). The correspondence problem is now simply the problem of finding the connections between input and model nodes that link corresponding object parts, e.g., connections between head and head, neck and neck, etc. In Figure 1A the black connections or links show the correct correspondences as a subset of all possible ones. 
Figure 1
 
The visual correspondence problem is the task of linking corresponding points between two images. (A) Input and model images are represented by arrays of feature nodes (black circles). All potential correspondences are symbolized by lines between the feature nodes. The correct correspondences are indicated as black lines. (B) Examples of wrong correspondences, although connecting equal feature types.
Figure 1
 
The visual correspondence problem is the task of linking corresponding points between two images. (A) Input and model images are represented by arrays of feature nodes (black circles). All potential correspondences are symbolized by lines between the feature nodes. The correct correspondences are indicated as black lines. (B) Examples of wrong correspondences, although connecting equal feature types.
As a prerequisite for correspondence finding, feature similarities must be computed. Unfortunately, in realistic applications high feature similarity is not sufficient to find correct correspondences. Different images of the same object may vary greatly, leading to high similarity between non-corresponding points (see, e.g., Wiskott, 1999). Figure 1B shows this in cartoon form, black lines connect the features with highest similarity, which results in wrong correspondences in this case. For realistic inputs such situations are very frequent and the ambiguities increase the more feature detectors are used. For a human observer, in distinction, it is easy to find correct correspondences, also in Figure 1B. The reason for this is that an object is defined by its features and their spatial arrangement. Correspondence-based systems therefore have to take both of these cues into account. Shifter circuits (Anderson et al., 2005; Olshausen et al., 1993) are relatively rigid in activating topologically consistent sets of links, whereas dynamic link matching (Lades et al., 1993; Wiskott & von der Malsburg, 1996; Würtz, 1995) has a more flexible dynamic control that lets neighboring links communicate directly with each other. In this paper, we use a flexible dynamics, as in earlier DLM systems, together with explicit units that control the connectivity between layers similar to control units in shifter circuits. A two-layer model of this type was suggested in Lücke et al. (2008) to study fast and neurally plausible solutions to the correspondence problem. Here, we study a three-layer system and its application to the more complex task of recognition. 
The principal architecture of the system discussed in this work is shown in Figure 2. It consists of three layers: an Input Layer for image representation, an Assembly Layer, and a Gallery Layer as memory. The Assembly Layer establishes correspondences between input and memory. It recurrently integrates information about feature similarity, feature arrangement, and face identity. Given an input, the integration of these information components results in the system to converge to a state that represents a percept. Figure 2 sketches the system after such a convergence when it has correctly established correspondences between a person's face stored in memory (i.e., in the Gallery Layer) and a given input image of this person. In the following the system's architecture and neurodynamic mechanisms will be discussed in detail. 
Figure 2
 
Principle of object recognition in our system. The system has to simultaneously represent information about position and identity of the input face and its parts. Positional information is represented by dynamic links establishing correspondences between points in the input image and in the internal reference frame (“Assembly Layer”). Identity information is represented by the activity of Gallery units, different graphs storing memories of different faces. Both modalities contribute to the internal Assembly Layer, which reconstructs the visual input information.
Figure 2
 
Principle of object recognition in our system. The system has to simultaneously represent information about position and identity of the input face and its parts. Positional information is represented by dynamic links establishing correspondences between points in the input image and in the internal reference frame (“Assembly Layer”). Identity information is represented by the activity of Gallery units, different graphs storing memories of different faces. Both modalities contribute to the internal Assembly Layer, which reconstructs the visual input information.
A dynamic model of cortical columns
The computational elements of our system are motivated by anatomical and physiological properties of the cortex on the scale of a few hundred microns. In particular, our modeling reflects the cortex's columnar organization (see, e.g., Mountcastle, 1997) and the concept of canonical cortical microcircuits 1 as, e.g., suggested by Douglas, Martin, and Witteridge (1989). That is, we take cortical columns as basic computational elements of our network and assume that all columns perform similar stereotypical computations. Depending on the perspective or the cortical area a cortical column is commonly referred to as macrocolumn (Mountcastle, 1997), segregate (Favorov & Diamond, 1990), hypercolumn (Hubel & Wiesel, 1977) or simply column (e.g., Yoshimura, Dantzker, & Callaway, 2005) and, for instance in primary visual cortex, comprises roughly all neurons that can be activated from one point in visual space. 
The analysis of the fine structure within a column suggests disjunct populations of excitatory neurons as functional elements. Anatomically, axons and dendrites of pyramidal cells have been found to bundle together and to extend orthogonally to the pial surface through the cortical layers. All neurons that directly contribute to one such bundle form a thin columnar module of just a few tens of microns in diameter (Buxhoeveden & Casanova, 2002; Peters & Sethares, 1996; Peters & Yilmaz, 1993). Together with associated inhibitory neurons (see, e.g., DeFelipe, Hendry, Hashikawa, Molinari, & Jones, 1990; Peters & Sethares, 1997) such a module was termed minicolumn (see, e.g., Buxhoeveden & Casanova, 2002; Favorov & Kelly, 1994; Mountcastle, 1997, 2003; Peters & Sethares, 1996) and was suggested as the basic computational unit of cortical processing (but see Jones, 2000; Rockland & Ichinohe, 2004, for critical discussions). More recent evidence for disjunct functional units within a cortical column comes from experiments using focal uncaging of glutamate combined with intracellular recordings (Yoshimura et al., 2005). It was found that a column has a fine structure of functionally relatively disjunct populations of layer 2/3 pyramidal cells. The relation of these populations to the cortical minicolumn has yet to be clarified, however. The main potential difference is that the concept of a minicolumn requires neurons in a population to be spatially adjacent whereas for neurons in the functional populations described in Yoshimura et al. (2005) this is not necessarily the case. 
Independent of the spatial arrangement of a column's functional subpopulations, there is little dispute about the existence of lateral coupling of such populations via a system of inhibitory neurons (Peters & Sethares, 1997; Yoshimura et al., 2005). For example, in Yoshimura et al. (2005), the excitatory populations of layer 2/3 have been found to receive common and population-unspecific input from inhibitory neurons of the same layer as well as from inhibitory neurons of layer 4 (see also Dantzker & Callaway, 2000). 
We will define our dynamic model of a cortical column in accordance with these experimental findings. To be somewhat independent of different terminologies used in different communities, we will refer to the cortical column simply as column (instead of, e.g., macrocolumn or hypercolumn) and we will refer to its functional subpopulations as the column's units
In our model of a column we use dynamic variables that describe population rate activity following mean-field arguments discussed in numerous contributions (see, e.g., Gerstner, 2000; Latham, Richmond, Nelson, & Nirenberg, 2000; Lücke & von der Malsburg, 2004; Marti, Deco, Giudice, & Mattia, 2006; van Vreeswijk & Sompolinsky, 1998; Wilson & Cowan, 1973, for a columnar model). We describe the unit's neural activity by a differential equation called modified evolution equation. This equation represents our model of inhibition among the column's units and is a generalization of the well-known deterministic evolution equation (see, e.g., Eigen, 1971). 
The activity x i of the ith unit in a column of K units is given by  
τ d d t x i = x i ν I i x i j = 1 K I j x j ,
(1)
where τ is a time constant and the exponent ν parameterizes the competition strength among the units. I i represents the input signal to unit x i
For ν = 0, there is no competition, and Equation 1 simplifies to  
τ d d t x i = I i x i j = 1 K I j x j .
(2)
 
In this case, all units represent their input proportionally, while the interaction term ∑ j I j x j leads to activity normalization in the column (see 1 for a proof). For ν = 1, on the other hand, we have strong competition among the units, leading to winner-take-all (WTA) behavior (see 1). 
In our model of object recognition we assume that there are two types of columns with different functions. Dynamically, they only differ in the use of the competition parameter ν
  1.  
    Feature columns represent their input in a linear fashion. Consequently, the units in a feature column have no need to compete among each other, i.e., for them the parameter ν = 0.
  2.  
    Decision columns show a WTA behavior leading toward a state where only the unit getting the strongest input remains active. These units receive a ν signal that linearly rises from 0 to 1. 2 So they start out with linear dynamics like feature columns. With rising ν, competition increases, finally leading to a WTA behavior that leaves only the unit with the strongest input active. The typical dynamics of a decision unit is shown in Figure 3.
Figure 3
 
Typical time course of the unit activities in an isolated decision column. The inputs to the K = 10 units are spread equidistantly between 0 and 1. The competition parameter ν rises from 0 to 1 during a time period of T = 400 τ. Note that the WTA behavior seen here results directly from the growth of the competition parameter ν. The internal dynamics of a column is much faster, so that with respect to the slow growth of ν, a column is always in quasi-steady state. This can be seen also in the fast rise of the unit activities from very small initial values to the significantly higher steady states.
Figure 3
 
Typical time course of the unit activities in an isolated decision column. The inputs to the K = 10 units are spread equidistantly between 0 and 1. The competition parameter ν rises from 0 to 1 during a time period of T = 400 τ. Note that the WTA behavior seen here results directly from the growth of the competition parameter ν. The internal dynamics of a column is much faster, so that with respect to the slow growth of ν, a column is always in quasi-steady state. This can be seen also in the fast rise of the unit activities from very small initial values to the significantly higher steady states.
The crucial computations will in our system be performed by decision columns, whereas feature columns serve for information representation. Both kinds of columns may actually have the same neural substrate with the only difference that feature columns do not receive (or just do not respond to) the ν signal. 
In the networks that we will introduce in the following section, the units of a column communicate with units of other columns. For this communication, a column scales the output activities of its K units such that its output energy (i.e., the 2-norm of the column activity vector) stays constant 3 :  
x i : = x i j = 1 k x j 2 .
(3)
 
This kind of output normalization is advantageous for maintaining homeostatis in networks of columns and may be carried out by neurons in layer 5 of the cortex as suggested by Douglas and Martin (2004). 
The network model
In Lücke et al. (2008) a model for correspondence finding is described that makes use of a population code within cortical columns, which allows fast point-to-point matching between two patterns (estimated in the range or below 100 ms). Here we extend this model to a system that matches images of different geometry and can compare input images to a gallery of many stored models simultaneously. Preliminary results of this work have been described in Wolfrum, Lücke, & von der Malsburg (2008). 
Our network is made up of layers, which loosely correspond to the different cortical areas that make up the visual system (we are not speaking here of the layers of anatomically different neurons that can be distinguished within one area of cortex). Layers are organized topologically, with a topology that may be stimulus space, like in V1 and somatosensory cortex, or a more abstract space. The layers of our network interact recurrently and activity collectively converges toward a final state that represents the “percept” of the network, in our case the possible recognition of a face. 
Layers may contain both feature columns and decision columns. If we assume every feature column to represent all relevant features at one position of a retinal image, then layers of feature columns can represent whole images. The network introduced below uses layers of two different spatial arrangements: 
  1.  
    Rectangular grid. Straightforward representation suitable for any image. Every column represents one specific geometric location (see Figure 4A).
  2.  
    Face graph structure. An arrangement specifically suited for faces, where each column represents an important landmark position on a face (see Figure 4B). Note that in this case, a column does not necessarily represent a fixed spatial location in the image, but rather a fixed semantic location (nose, mouth, eye, chin, etc.). Spatial locations of landmarks can change according to the face they represent.
Figure 4
 
Different representations of facial images. A rectangular grid graph (A) is used for input image representation, a face graph (B) consisting of characteristic points (landmarks) is a dedicated data structure used for internal face representation.
Figure 4
 
Different representations of facial images. A rectangular grid graph (A) is used for input image representation, a face graph (B) consisting of characteristic points (landmarks) is a dedicated data structure used for internal face representation.
The network consists of the following three layers (see Figure 5): 
Figure 5
 
Architecture of our network. The gray oval structures represent columns (the vertical ones feature columns, the horizontal ones decision columns), with units as lighter cylinders inside. The numbers of units and columns shown here are chosen exemplarily for visualization purposes only and are not identical to the real numbers of units used in this work. The Input Layer is organized in a rectangular grid (represented by the light blue lines connecting columns), while both the Assembly Layer and the Gallery Layer have face graph topology. At each landmark in the Assembly Layer there are three columns, two feature columns of the Input Layer and Gallery Assembly, and one control column. Input and Assembly are connected all to all (shown exemplarily for the left lowermost point in the Assembly Layer), while Assembly landmarks are connected only to the same landmarks in Gallery but to all identity units there (see also Figure 6). The green lines connecting the three layers and the subset of green highlighted (= activated) Gallery units represent a possible final state of the network.
Figure 5
 
Architecture of our network. The gray oval structures represent columns (the vertical ones feature columns, the horizontal ones decision columns), with units as lighter cylinders inside. The numbers of units and columns shown here are chosen exemplarily for visualization purposes only and are not identical to the real numbers of units used in this work. The Input Layer is organized in a rectangular grid (represented by the light blue lines connecting columns), while both the Assembly Layer and the Gallery Layer have face graph topology. At each landmark in the Assembly Layer there are three columns, two feature columns of the Input Layer and Gallery Assembly, and one control column. Input and Assembly are connected all to all (shown exemplarily for the left lowermost point in the Assembly Layer), while Assembly landmarks are connected only to the same landmarks in Gallery but to all identity units there (see also Figure 6). The green lines connecting the three layers and the subset of green highlighted (= activated) Gallery units represent a possible final state of the network.
  1.  
    Input Layer
    I
    : Represents the input image in a rectangular grid.
  2.  
    Assembly Layer: Integrates intermediate information from both the input image (represented in the Input Assembly units
    I A
    , see Figure 6) and the gallery (represented by the Gallery Assembly units
    G A
    ).
  3.  
    Gallery Layer
    G
    : Represents all gallery faces in terms of the weights of its afferent and efferent connections to the Assembly Layer.
Figure 6
 
Information flow in our system. Visual information in form of Gabor jets J extracted from an input image activates the Input Layer I . It flows to the Assembly Layer (Input Assembly, I ⁢ A ) and from there to the Gallery G , where it activates via receptive fields v some memories more strongly than others. Information representing the active memories (stored in projection fields w analogous to v) flows back to the Gallery Assembly G ⁢ A . Information flow I I ⁢ A from the Input Layer to the Input Assembly is modulated by the control units C , which in turn are driven by the similarity of those image patches in the Input Layer and the Gallery Assembly that they connect. By activating those control units that connect positions of the Input Layer containing similar information as the Gallery Assembly, the system effectively focuses on those parts of the input image that contain visual information most similar to the current reconstruction in the Gallery Assembly, formed by superposition of active units in the Gallery Layer. The thick black arrows represent the competition among the decision columns of which the Gallery and the control columns consist. The symbols correspond to those used in the text.
Figure 6
 
Information flow in our system. Visual information in form of Gabor jets J extracted from an input image activates the Input Layer I . It flows to the Assembly Layer (Input Assembly, I ⁢ A ) and from there to the Gallery G , where it activates via receptive fields v some memories more strongly than others. Information representing the active memories (stored in projection fields w analogous to v) flows back to the Gallery Assembly G ⁢ A . Information flow I I ⁢ A from the Input Layer to the Input Assembly is modulated by the control units C , which in turn are driven by the similarity of those image patches in the Input Layer and the Gallery Assembly that they connect. By activating those control units that connect positions of the Input Layer containing similar information as the Gallery Assembly, the system effectively focuses on those parts of the input image that contain visual information most similar to the current reconstruction in the Gallery Assembly, formed by superposition of active units in the Gallery Layer. The thick black arrows represent the competition among the decision columns of which the Gallery and the control columns consist. The symbols correspond to those used in the text.
The following three subsections describe these layers in detail. 
Input layer
The Input Layer represents the input image using 400 feature columns arranged in a rectangular grid of P = 20 × 20 points. Each feature column represents by its units' activities K features extracted from the image at that position. 
If we neglect color and binocularity, the response properties of neurons in primary visual cortex are commonly described by the well-known Gabor wavelets (Daugman, 1980; Jones & Palmer, 1987; Ringach, 2002). Note that other wavelets like the Cauchy filter (Wallis, 2001) might presumably also lead to good results. In our model we use a predefined set of Gabor wavelets that appropriately sample orientation (over 8 orientations) and spatial frequency (over 5 scales) space, resulting in a number of K = 40 features at each point. That is, we use Gabor filter responses to model the RFs of the feature units in the Input Layer. For extracting the filter responses, we use the standard Gabor transform, as described in 1. As feature values we use the magnitude
J
of the responses, thus ignoring Gabor phase, to model complex cell responses (Hubel & Wiesel, 1977). Implicitly, Gabor phase is still represented by the positions of the feature columns in the input image. In applications using Gabor features it has turned out that with K = 40, as above, good results can be achieved (Wundrich, von der Malsburg, & Würtz, 2004). Performance increases for more wavelets, but 40 represents a good compromise between performance and computational cost. 
Each Input Layer unit being responsive to a certain Gabor feature
J
i p at its position p on the input grid, the unit activities follow the dynamics (cf. Equation 2)  
τ d d t x i I p = J i p x i I p j = 1 K J j p x j I p .
(4)
 
Assembly layer
The Assembly layer integrates intermediate information from both the input image (represented in the Input Assembly units) and the gallery (represented by the Gallery Assembly units, see Figure 6). The role of the Input Assembly is to represent a normalized version of the input image, while the Gallery Assembly accommodates a weighted average of all Gallery faces. This information is organized in a face graph arrangement with Q = 48 landmarks (see Figure 4B). Since the face graph in the Assembly Layer has to be able to represent many different faces, we determine its geometry by averaging over several hundred face graphs of individual faces. 
The columns of the Input Assembly and Gallery Assembly are feature columns, i.e., they integrate their inputs (defined below) according to Equation 2. The input
I I A
to the ith Input Assembly unit at position q of the face graph is a weighted sum of the ith Gabor feature at all grid positions p of the Input Layer, modulated by the respective control units:  
I I A q , i = 1 P p = 1 P C p , q I p , i ,
(5)
with
C
p,q being the output strength of the dynamic link (see below) controlling the flow of the output of Input column
I
p to Input Assembly column
I A
q
The input
I I I
to a Gallery Assembly unit at position q is the superposition of all Gallery activities at the same landmark, filtered/multiplied by the feature vector represented by that respective Gallery unit:  
I q , i G A = 1 M m = 1 M w q , m , i G q , m ,
(6)
with the “efferent weight” w q,m,i representing the strength of Gabor feature i in landmark q of Gallery image m (of M in total). 
Control units
The Assembly Layer also contains the control units mentioned above, which mediate the signal coming in from the Input Layer. These control units provide potential connections (dynamic links) between every Input Layer point to every point in the Input Assembly. The activity of the control units is driven by the feature similarity of the corresponding points in the Input Layer and the Gallery Assembly. That is, the similarity between the non-normalized input face in the Input Layer and the weighted average face in the Gallery Assembly controls via the control units how input information flows to the Input Assembly. In that sense the control units define a geometric mapping between Input and Assembly Layers. Additionally to the feature similarity input, control units get support from neighboring control units that represent similar mappings (see Figure 7 and paragraph below for details). 
Figure 7
 
Interaction among control units to achieve a topologically consistent (i.e., continuous) mapping. The unit controlling the blue link strengthens control units in neighboring columns that represent links of similar orientation. Maximal cooperation would occur with perfectly parallel links (the green dashed axes of the gray cones). Since in reality links in the network only exist to the nodes of the input grid (full lines), the strength of cooperation (represented by the shade of green) depends on the degree of parallelity with the blue link, equivalent to the distance of a link's end point from the cone center.
Figure 7
 
Interaction among control units to achieve a topologically consistent (i.e., continuous) mapping. The unit controlling the blue link strengthens control units in neighboring columns that represent links of similar orientation. Maximal cooperation would occur with perfectly parallel links (the green dashed axes of the gray cones). Since in reality links in the network only exist to the nodes of the input grid (full lines), the strength of cooperation (represented by the shade of green) depends on the degree of parallelity with the blue link, equivalent to the distance of a link's end point from the cone center.
The dynamic links are decision units, meaning that their dynamics follow Equation 1. The input
I C
to a dynamic link
C
p,q connecting input position p and assembly position q is given by the scalar product between both column outputs plus a topological interaction term:  
I p , q C = i = 1 K I p , i G A q , i + c t o p , C | n e i g h b o r s | p ˜ , q ˜ f t o p ( p , q , p ˜ , q ˜ ) C p ˜ , q ˜ ,
(7)
where c top,
C
defines the maximal strength of topological interaction between control units (see below), and ∣neighbors∣ is the number of topologic neighbors the control column has in the face graph. 
Topological cooperation among control units
As mentioned before, there is topological cooperation among the control units of the Assembly Layer. The purpose of this cooperation is to establish a continuous mapping between the different geometries of the Input Layer and the Input Assembly. A given dynamic link connects a specific column A of the Input Layer with a column B of the Input Assembly. Due to the geometry of both layers, both columns represent distinct positions
z
A and
z
B in retinal coordinates and internal image representation space, respectively. Consequently, the dynamic link between them represents a certain geometric distance
d
i =
z
B
z
A
The idea is now to have topological connections in order to support parallel or near-parallel dynamic links. Therefore we define the strength of a topological connection between any two dynamic links i and j whose columns are neighbors in the face graph through a monotonically decreasing function of their nonparallelity/disparity:  
f t o p ( i , j ) = f ( d j d i 2 ) .
(8)
Here we use a linearly decreasing thresholded function of the form  
f ( y ) = max ( 0 , 1 y β ) .
(9)
Thus topological interaction is always positive and acts only between more or less (depending on β) parallel neighboring links. This principle is depicted in Figure 7. To obtain the topological interaction f top in Equation 7 between two control units
C
p,q and
C p ˜ , q ˜
, we first calculate from the coordinates of the columns they control in the Input and the Assembly Layer the geometric distances
d
p,q and
d p ˜ , q ˜
represented by them. From these we calculate the disparity of the two control columns according to Equation 8 and the topological interaction via Equation 9
Gallery layer
The Gallery Layer represents all M gallery face images in a face graph of Q decision columns. Each column corresponds to one landmark, with the units representing specific feature vectors for the individual faces at the respective landmarks by their afferent and efferent connections (see Figure 6). The units in the Input Assembly activate the Gallery units through receptive fields v representing the stored facial landmark features, activating more strongly units of faces that are similar to the normalized input image in the Input Assembly:  
I q , m G = i = 1 K v q , m , i I A q , i + c t o p , G Q q ˜ = 1 Q G q ˜ , m .
(10)
Additionally, there is interaction among the Gallery units, with c top,
G
defining how strongly Gallery units representing the same face cooperate. That is, all landmarks that belong to the same face cooperate, and at each landmark the corresponding features of all different faces compete. 
The Gallery projects a weighted superposition of its stored faces to the Gallery Assembly through efferent weights w that are identical to its afferent weights v (cf. Equation 6). Point-to-point comparison with the Input Assembly and competition among stored models leaves only the correctly recognized identity active in the end. 
Simulations
We now simulate the dynamics defined in the above sections using natural images of faces as input and as memories in the gallery. To integrate Equation 1, we simply use the Euler method but adapt its time step dynamically to the average change of activity in the network in order to keep the system stable. All units have a small, but non-zero initial activity x(0) = 0.01. The units in the Input Layer, which receive input directly from the incoming image (cf. Equation 4), quickly converge to a state where they represent the input image via the different Gabor feature values at all grid positions. This information flows to the Input Assembly modulated by the activities of the control units (Equation 5), which connect every point in the Input Layer with every point in the Input Assembly. Since initially all control units have equal activity, this leads to a superposition of image information from all Input Layer points at each Input Assembly location, resulting in a featureless, more or less homogeneous image in the Input Assembly (first image in Figure 8). In the Gallery Layer, all faces are equally active initially. The Gallery Assembly, which receives input from all Gallery units (Equation 6), will therefore initially receive a superposition of all Gallery faces, resembling an “average face” (like the first image in Figure 9). 
Figure 8
 
The process (from top to bottom) of finding the correct mapping between the Input Layer and the Input Assembly. Each row shows the control unit activities on the left side, and on the right first the constant input image, and then an image reconstructed from the activities of the 48 landmarks of the Input Assembly. Initially, the control units have all nearly identical activity, and therefore the Input Assembly receives a superposition of all input information, resulting in the same uniform image information at all landmarks (row one). With the control units developing a topologically consistent match between Input and Input Assembly (rows two and three), this image starts to differentiate toward a normalized (i.e., shifted and deformed if necessary) version of the input image. The mapping via the control units is also visualized by the colored lines connecting the input image with the Input Assembly. Each line represents the “center of mass” of a control column, i.e., the location in the input image where its units are pointing to as a group, weighted by their activity. Click here to view movies of the correspondence finding process.
Figure 8
 
The process (from top to bottom) of finding the correct mapping between the Input Layer and the Input Assembly. Each row shows the control unit activities on the left side, and on the right first the constant input image, and then an image reconstructed from the activities of the 48 landmarks of the Input Assembly. Initially, the control units have all nearly identical activity, and therefore the Input Assembly receives a superposition of all input information, resulting in the same uniform image information at all landmarks (row one). With the control units developing a topologically consistent match between Input and Input Assembly (rows two and three), this image starts to differentiate toward a normalized (i.e., shifted and deformed if necessary) version of the input image. The mapping via the control units is also visualized by the colored lines connecting the input image with the Input Assembly. Each line represents the “center of mass” of a control column, i.e., the location in the input image where its units are pointing to as a group, weighted by their activity. Click here to view movies of the correspondence finding process.
Figure 9
 
Time course (from top to bottom) of the Gallery unit activities (left) and of the resulting image representation in the Gallery Assembly (right). The Gallery Assembly gets input from all Gallery units and thus contains an activity-weighted average of all faces in the gallery. Initially, when all Gallery units are nearly equally active, this weighted average is a real average of all gallery faces, i.e., a mean face (uppermost row). With ongoing dynamics and rising competition, the Gallery units fitting the input image better get stronger, and the Gallery Assembly activity develops toward the respective gallery faces. Finally, only one unit of all Gallery columns is active, and the Gallery Assembly contains a representation of the image the system has recognized (which is not identical to the input image in most applications, cf. input image in Figure 8). Click here to view movies showing the identification process.
Figure 9
 
Time course (from top to bottom) of the Gallery unit activities (left) and of the resulting image representation in the Gallery Assembly (right). The Gallery Assembly gets input from all Gallery units and thus contains an activity-weighted average of all faces in the gallery. Initially, when all Gallery units are nearly equally active, this weighted average is a real average of all gallery faces, i.e., a mean face (uppermost row). With ongoing dynamics and rising competition, the Gallery units fitting the input image better get stronger, and the Gallery Assembly activity develops toward the respective gallery faces. Finally, only one unit of all Gallery columns is active, and the Gallery Assembly contains a representation of the image the system has recognized (which is not identical to the input image in most applications, cf. input image in Figure 8). Click here to view movies showing the identification process.
To each control unit in the Assembly Layer a unique pair of feature columns is assigned, one in the Input Layer and the other one in the Gallery Assembly. The control units are driven by the similarity (expressed in terms of the scalar product) of the information stored in their dedicated feature columns, see Equation 7. Therefore control units that connect points of the average face with similar input points will become stronger, while control units representing irrelevant matches will be weakened. Over the process of recognition, the activity distribution of the control units becomes more and more sparse, until it finally represents a unique mapping between the Input Layer and the Assembly Layer (see left column in Figure 8). Since purely local similarity of images can be quite ambiguous, the additional topological interaction among the control units is necessary in this process to achieve a globally consistent match. As the information flow from the Input Layer to the Input Assembly is modulated by the control units, the image in the Input Assembly will start to develop from a gray nondescript superposition to a more and more clear version of the input image (right column in Figure 8). It may be shifted and possibly distorted such that it conforms to the topology of the face graph of the Gallery Assembly. 
The image information in the Input Assembly in turn acts as input to the Gallery units, where it gets filtered through the individual receptive fields of the units (Equation 10), exciting those units more that represent faces more similar to the input image. Due to competition between the units of each Gallery column and cooperation among units of different landmarks representing the same face the Gallery will start to favor some of the stored faces over others (cf. left column of Figure 9). This in turn changes the image in the Gallery Assembly from an average face to a superposition that is already biased toward one or several of the better fitting gallery faces (second and third face images of Figure 9). This sharpened target face now helps to position the normalized input image even more precisely, and so forth. In the final state, the Input Assembly will contain a shifted and maybe distorted version of the input image, while in the Gallery Layer the units of only one face are still active, and the Gallery Assembly contains a copy of that face of the Gallery that the system judges to be most similar to the input image. 
Note that we can numerically simulate dynamics (Equation 1) without specifying a value of the time constant τ. As long as the simulation time T remains constant relative to τ, simulation results will be independent of τ. The question of how the time course of the dynamics translates to recognition times in biological terms does, however, crucially depend on the actual choice of τ. Using a time constant of, e.g., τ = 0.2 ms results in a system that selects the winning subpopulations of its decision columns in about 80 ms (with T = 400 τ, compare Figure 3). The whole network could consequently converge to a face position and identity within about the same time. Numerical simulations of a single column with explicitly modeled spiking neurons suggest an even smaller time constant (see Lücke and von der Malsburg, 2004 or compare Muresan and Savin, 2007 for population activation times on the order of 10 ms, which suggest similarly fast deactivation times). For more detailed conclusions it would, however, be necessary to model a system with a setup like the presented one but based on detailed single neuron models instead of abstractions for populations. The velocity of recognition in such a system would ultimately depend on the time constants of the single neurons' ion channels and dendritic and axonal conduction times in such a case. 
To quantitatively compare our system to other approaches, we tested it on the FERET (Phillips, Wechsler, Huang, & Rauss, 1998) and the AR (Martinez & Benavente, 1998) benchmark databases. We followed the testing protocol of Phillips et al. (2005) and of Tan, Chen, Zhou, and Zhang (2005). The FERET database contains images of 1196 individuals, while the subsets of the AR database used by Tan et al. (2005) and by us contain 100 faces. The measured recognition rates (see Table 1) show that our system is competitive with purely functionally motivated approaches, although it cannot compete with the best performing system for each individual test category. It should be noted, however, that most of those systems were only tested on a single database, allowing fine-tuning for these specific circumstances, while we used the same parameter settings on both databases. In general, our focus was rather on creating a neurally plausible system than on investing much effort in parameter tuning. For a more extensive discussion of these results see Wolfrum et al. (2008). 
Table 1
 
Recognition rates of our system compared to those reported in the literature. Results are given in %, the best performance in any testing category typed in bold. The first column shows the recognition rates of our system (in %) for different probe sets of the FERET and the AR databases. The following three columns (A, B, and C) show those systems evaluated in Phillips et al. (2000) and Tan et al. (2005), which performed best in at least one category. Note that these systems were mostly only tested on either the FERET or the AR database.
Table 1
 
Recognition rates of our system compared to those reported in the literature. Results are given in %, the best performance in any testing category typed in bold. The first column shows the recognition rates of our system (in %) for different probe sets of the FERET and the AR databases. The following three columns (A, B, and C) show those systems evaluated in Phillips et al. (2000) and Tan et al. (2005), which performed best in at least one category. Note that these systems were mostly only tested on either the FERET or the AR database.
Recognition rates [%] Our system A B C
FERET fafb 95 95 92
duplicate I 47 59
duplicate II 26 52
AR Emotion 91 95 82
Em. duplicate 61 81 82
Occlusion 73 96 81
Occ. duplicate 36 56 51
Discussion
We present here a fully neural model for face recognition that goes in essential ways beyond previous work in our own group (Wiskott, Fellous, Krüger, & von der Malsburg, 1997; Wiskott & von der Malsburg, 1996; Zhu & von der Malsburg, 2004) and the recent model in Lücke et al. (2008). It combines findings from psychophysics, imaging studies, and physiology, while performing competitively on benchmark tests for face recognition. 
The basic building block of our system is a model of the cortical column, the biological relevance of which has been discussed earlier (Lücke & von der Malsburg, 2004). That model makes use of a population code for stimulus representation, thus allowing significantly faster computations than with rate codes of single neurons (for early arguments in this direction see, e.g., van Vreeswijk & Sompolinsky, 1998). An essential ingredient of the model is formed by dynamic links. These are synaptic connections that are modulated by the activity of control units, whose activity in turn is controlled by signal comparisons. For a discussion of the hypothesis that control units might be formed by astrocytes (see Möller, Lücke, Zhu, Faustmann, & von der Malsburg, 2007). There is strong experimental evidence that receptive fields of neurons are not static (see Luck et al., 1997, and other references in the Introduction section), suggesting the existence of dynamic links. Other models in the literature argue for similar concepts, like the control units of Olshausen et al. (1993) or Sigma-Pi neurons (e.g., Weber & Wermter, 2007). 
In anatomical terms, the different layers of our model can be interpreted as follows. The input layer represents incoming image information by Gabor wavelets, which resemble the receptive field properties in primary visual cortex (V1). The biological counterpart of our Assembly Layer would be an area like central or anterior inferotemporal cortex. Neurons here respond to stimuli from large parts of the visual field, and they code for complex shapes (Tanaka, 1996, 2003) similar to the face parts represented by the Assembly Layer. The fact that information about object position and scale can be read out from IT neurons (Hung, Kreiman, Poggio, & DiCarlo, 2005), which disagrees with the assumptions made by pure pooling models, points to the possibility of our control units residing there as well. Of course, in the cortex the mapping from V1 to IT does not happen directly, but via intermediate stages including V2 and V4. This is not accounted for in our current model but will be included in future extensions. We have described previously the likely form (Wolfrum & von der Malsburg, 2007b) of such routing over several stages and a possible ontogenetic mechanism (Wolfrum & von der Malsburg, 2007a). Finally, the Gallery of our model might correspond to an area like the fusiform face area (ffa), which is specialized for face recognition (Kanwisher & Yovel, 2006; Tsao, Freiwald, Tootell, & Livingstone, 2006). Note that the detection of faces (as distinct from recognition) is not modeled by us. Also in the brain, this appears to happen outside of ffa. Summerfield et al. (2006) find neurons in medial frontal cortex that are selectively active when subjects have to make a face vs. non-face decision, independently of face identity. Likewise, prosopagnosia patients recognize objects as faces but cannot identify them (Zhao, Chellappa, Phillips, & Rosenfeld, 2003). As discussed before, face recognition is special because faces have a generic shape, but recognition from thousands of individuals requires high sensitivity to detailed differences. This might become possible through competitive interaction (which in fact is the mechanism by which recognition happens in our Gallery Layer) in the small and compact ffa (Kanwisher, 2006). Apart from faces, there is evidence suggesting that ffa can also serve as an area of expertise for other object classes (Gauthier, Skudlarski, Gore, & Anderson, 2000; Tarr & Gauthier, 2000). In the same sense, our model is not confined to face recognition but could be used for recognition of any kind of object type that has a prototypical shape and requires high sensitivity to small differences among objects. 
Our system is correspondence based, as in Arathorn (2002), Hinton (1981), and Olshausen et al. (1993), in distinction to the majority of object recognition models (e.g., Mel, 1997; Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007). As discussed in the Introduction section, we offer here several arguments in favor of correspondence-based object recognition. One is functionality—while feature-based systems are very good at classifying images into categories (Fei-Fei, Fergus, & Perona, 2006; Serre et al., 2007), they perform poorly in areas like face recognition, where high sensitivity to subtle image differences is required. As pointed out in the Introduction section, according to the latest test (Phillips et al., 2003, see also www.frvt.org) the best-performing commercial systems are all correspondence based. 
Second, the system is extensible to active perception and vision in dynamic environments. In distinction to feed-forward models without dynamics (like Serre et al., 2007), which behave like static filter banks, in the proposed system the activity traces of previous processing (a previous ν cycle) can influence the outcome of current perception. This permits to model temporal effects in perception. Instances of these are priming or congruency effects, which abound in psychophysics (for a review, see Graf, 2006) and which can be easily replicated in our framework (Wolfrum & von der Malsburg, 2008): preactivating a subset of the control units at the beginning of a ν cycle (equivalent to spatial attention) results in a bias in favor of a specific location. For a large input image containing several faces, the system preferentially processes and recognizes objects at that position. Note that the priming of pathways instead of feature activity merely biases the system toward a certain decision without distorting the content that is being processed. In contrast, for example the model of Deco and Rolls (2004) implements spatial attention by increasing activity at the input level, which might lead to a distortion of information content. Also experimental findings (Luck et al., 1997) suggest that priming shifts effective receptive fields (that is, pathways) without changing activity within the afferent layer. Similarly, the priming of objects is possible with the help of preactivating search images in the form of arbitrary combinations of facial features in the Gallery Layer, leading to preferential detection and recognition of a similar face among several others in the visual field. In distinction, a feed-forward model would only permit the preactivation of objects for which there is already a dedicated representation. Furthermore, preactivation of dedicated, implicit representations would not generate an explicit search image at lower layers but a large unstructured activation of all features that potentially could give rise to the primed object. 
A third argument for our model and for correspondence-based recognition in general is its use of explicit object representations. In a cardinal cell, whose activity stands for a specific object or property, representation is implicit, as the activity of the cell does not express the structure of the object, which is only implicit in the synaptic patterns that define its firing condition. An explicit representation, on the other hand, makes a wealth of distinctions available to recipients elsewhere in the brain. An explicit representation therefore forms not just a sample point in a structure space but represents a whole space of variations. However, in distinction to the generality of the retinal representation, which captures only properties common to all visual scenes, the ultimate goal of the visual system is to lead to explicit representations that are tightly committed to distinctions applying to the actually inspected scene and that permit strong probabilistic coupling to variables in other modalities, such as action planning, motion control, or language. It is these explicit, temporarily committed representations that give us the feeling of being in direct contact with the visual reality out there. 
How does our model reflect these distinct types of representation? The representation in the Input Layer is explicit and uncommitted, and its generality is restricted only by the incomplete sampling of Gabor space. The representation in individual units in the Gallery Layer is implicit, highly specific and completely committed (to individual landmarks in individual faces). Given its activity state, the Gallery Layer creates, via its output connections to the Gallery Assembly, an explicit representation of a face. As long as inhibition is low, this representation is still uncommitted to an individual (see Figure 9, uppermost image), whereas at the end of the selection process it is fully explicit and committed to one individual face. While models of object recognition that explicitly reconstruct the input are common in probabilistic modeling (conceptually discussed, e.g., in Yuille & Kersten, 2006), it is still unclear how such reconstructions are realized by the brain. We hope that this paper can contribute to deepening our insight in this respect. 
Surely, our model still leaves open a number of important problems for future work. As is, the model is invariant only to translation and needs to be generalized to changing scale and orientation (which will require dynamic relinking of feature connections, see Sato, Jitsev, & von der Malsburg, 2008) as well as other image transformations such as changing illumination and perspective deformation. As discussed before, the system as proposed here assumes direct dynamic links from all positions in the Input Layer to all positions of the Input Assembly. This would require unrealistic fiber convergence numbers. Following the proposal of Olshausen et al. (1993), this problem can be solved with the help of a switchyard of connections over several intermediate layers. An optimized system of this kind (Wolfrum & von der Malsburg, 2007b) has reasonable complexity (of
O
(n log n) with a small prefactor) in terms of numbers of connections and intermediate units. A further problem yet to be solved is the generation of the gallery domain with the help of autonomous learning mechanisms and the generation of the very specific connectivity patterns of the link control units by ontogenetic mechanisms. A possibly unrealistic aspect of our model in its present form is the maintenance of the same low-level feature types (Gabor wavelets) through large parts of the system, a hierarchy of feature types of growing complexity being more likely to be realized in the visual system. Incorporation of this will most naturally be realized in the context of a more comprehensive model encompassing also ontogenesis and learning mechanisms. Another unrealistic aspect of the model is the flat structure of the gallery domain. As has been convincingly shown by Biederman (1987), many objects are recognized as ordered arrays of simpler subshapes. Again, realization of such structure in the gallery domain will require potent learning mechanisms. 
Object recognition is only one of a great multitude of functions of the visual system and of the brain. We feel, however, that the principle of correspondence finding by active information routing is of general importance as a paradigm of brain function. 
Supplementary Materials
Supplementary Movie - Supplementary Movie 
Supplementary Movie - Supplementary Movie 
Supplementary Movie - Supplementary Movie 
Supplementary Movie - Supplementary Movie 
Appendix A
Parameters
The size of the network is determined by the following parameters. The input grid consists of P = 400 columns, all face graphs (Assembly and Gallery Layers) contain Q = 48 columns. Consequently, there is a total of 400 × 48 = 19200 control units. For representation of visual information in the Input and Assembly Layers, we use K = 40 Gabor wavelets. The number M of Gallery faces depends on the size of the database on which the system is tested. In the case of the FERET database, this is M FERET = 1196, for the AR database we have M AR = 100. 
For the simulation results of the Simulations section we used the following settings and parameters. We chose a time constant of τ = 0.2 ms. As mentioned in the Simulations section, the specific choice of τ does not influence the behavior of the system as long as the ratio
T τ
between recognition time and time constant is fixed. The length of the ν cycle we chose is T = 400 τ = 80 ms. The radius for topological interaction among control units was β = 0.05 × image size. Maximum strength of this topological interaction was c top,
C
= 3.5. Cooperation strength between neighboring gallery units was c top,
G
= 0.1. 
Proof of self-normalization properties
Since the ratio
T τ
is very large, i.e., the time constant is much shorter than the overall simulation time, the unit activities are close to the adiabatic state, i.e.,
d d t
x i ≈ 0. We therefore analyze the steady state of the unit activities to derive the self-normalization properties. 
For the case of ν = 0 (linear representation), the unit dynamics are given by Equation 2:  
τ d d t x i = I i x i j = 1 K I j x j .
(A1)
The steady state of this is  
x i = I i j I j x j .
(A2)
Multiplying both sides with x i yields  
x i 2 = I i x i j I j x j ,
(A3)
and the sum of this term over all i is  
i x i 2 = 1 ,
(A4)
i.e., for ν = 0 the column activity is normalized to a 2-norm of 1. 
In the case of ν = 1 (WTA behavior), the unit dynamics are  
τ d d t x i = x i I i x i j = 1 K I j x j ,
(A5)
so in steady state we have  
τ i d d t x i = i x i I i i x i j x j I j = ( 1 i x i ) i x i I i = 0 .
(A6)
If there is any activity in the column, i.e., not all unit activities are zero simultaneously, this requires  
i x i = 1 ,
(A7)
which means that for ν = 1 the column activity self-normalizes to a 1-norm of 1. Consequently, the interaction term ∑ j I j x j is the average activity-weighted input to the column. This means that only those unit activities grow whose input is higher than this weighted mean input to the column, otherwise they decay. This lets the weighted input average grow, because the bias shifts toward strong inputs. Eventually, all unit activities decrease to 0 except for the unit with the strongest input, whose activity approaches 1. For the final steady state we can show this by setting the time derivative for a single unit to 0:  
x i ( I i j = 1 K I j x j ) = 0 .
(A8)
Here we see that for any i, we either have x i = 0, or I i = ∑ j I j x j , which can only be true for x i = 1 and all other x j = 0 (except for the degenerate case of two or more of the I i being exactly identical, which would presumably be solved in the brain by spontaneous symmetry breaking). 
Gabor transform
The model described in this paper uses a set of Gabor wavelets that appropriately sample orientation (over 8 orientations) and spatial frequency (over 5 scales) space. If V is an image with V (
z
) denoting the gray value of a pixel at the geometric position
z
= (
x y
), the filter responses R i (
z
) are given by  
R i ( z ) = V ( z ) ψ i ( z z ) d 2 z ,
(A9)
 
ψ i ( ζ ) = k i 2 σ 2 exp ( k i 2 ζ 2 2 σ 2 ) [ exp ( i k i z e t a ) exp ( σ 2 2 ) ] , σ = 2 π
(A10)
where the wave vector is parameterized as  
k i = ( k i x k i y ) = ( k ρ cos φ μ k ρ sin φ μ ) , k ρ = 2 ( ρ + 2 2 ) π , φ μ = μ π 8 ,
(A11)
with orientation parameter μ = 1,.., 8 and scale parameter ρ = 1,.., 5. That is, (R 1(
z
),…, R 40(
z
)) is a vector of Gabor filter responses in which each entry corresponds to one of the 40 combinations of ρ and μ. As feature values we use the magnitude  
J i p = | R i ( z p ) | ,
(A12)
thus ignoring Gabor phase. 
Acknowledgments
We thank Urs Bergmann for his help in programming, Alexander Heinrichs for helping to preprocess the database images, Cornelius Weber and Junmei Zhu for useful comments on the manuscript, and two anonymous reviewers whose criticism has helped us to clarify a number of points. This work was supported by the European Union through Project FP6-2005-015803 (“Daisy”), by the Gatsby Charitable Foundation, and by the Hertie Foundation. 
Commercial relationships: none. 
Corresponding author: Philipp Wolfrum. 
Email: wolfrum@fias.uni-frankfurt.de 
Address: FIAS, Ruth-Moufang-Str. 1, 60433 Frankfurt, Germany. 
Footnotes
Footnotes
1  Note that the concept of a canonical microcircuit is not necessarily tied to the concept of anatomically disjunct columns as promoted in Mountcastle (1997).
Footnotes
2  In principle, the competition parameter ν could be set to a constant value of ν = 1. However, slowly increasing competition within the columns of a network has in earlier systems proven to efficiently avoid local optima (Lücke et al., 2008). This is related to the slow change of the temperature parameter in simulated annealing-like systems (Kirkpatrick, Gelatt, & Vecchi, 1983), which serves the same purpose. In Körner, Gewaltig, Körner, Richter, and Rodemann (1999) the thalamic complex of intralaminar nuclei is discussed as a possible source of a fast and global modulatory signal to the cortex.
Footnotes
3  For brevity of notation, we will sometimes just use the name of a certain unit type (like
C
for control units) to denote the output of that unit. We will always point this out when we do so.
References
Anderson C. H. Essen D. C. V. Olshausen B. A. (2005). Neurobiology of attention. Directed visual attention and the dynamic control of information flow. Amsterdam, Netherlands: Academic Press/Elsevier.
Arathorn D. W. (2002). Map-seeking circuits in visual cognition: A computational mechanism for biological and machine vision. Palo Alto, USA: Stanford University Press.
Bak P. (1996). How nature works: The science of self-organized criticality. Heidelberg, Germany: Springer-Verlag.
Biederman I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94, 115–147. [PubMed] [CrossRef] [PubMed]
Biederman I. Kalocsai P. (1997). Neurocomputational bases of object and face recognition. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352, 1203–1219. [PubMed] [Article] [CrossRef]
Bundesen C. Larsen A. (1975). Visual transformation of size. Journal of Experimental Psychology: Human Perception and Performance, 1, 214–220. [PubMed] [CrossRef] [PubMed]
Buxhoeveden D. P. Casanova M. F. (2002). The minicolumn hypothesis in neuroscience. Brain, 125, 935–951. [PubMed] [Article] [CrossRef] [PubMed]
Dantzker J. L. Callaway E. M. (2000). Laminar sources of synaptic input to cortical inhibitory interneurons and pyramidal neurons. Nature Neuroscience, 3, 701–707. [PubMed] [CrossRef] [PubMed]
Daugman J. G. (1980). Two-dimensional spectral analysis of cortical receptive field profiles. Vision Research, 20, 847–856. [PubMed] [CrossRef] [PubMed]
Debruille J. B. Guillem F. Renault B. (1998). ERPs and chronometry of face recognition: Following-up Seeck et al. and George et al. Neuroreport, 9,3349–3353. [PubMed] [CrossRef] [PubMed]
Deco G. Rolls E. T. (2004). A neurodynamical cortical model of visual attention and invariant object recognition. Vision Research, 44, 621–642. [PubMed] [CrossRef] [PubMed]
DeFelipe J. Hendry S. H. Hashikawa T. Molinari M. Jones E. G. (1990). A microcolumnar structure of monkey cerebral cortex revealed by immunocytochemical studies of double bouquet cell axons. Neuroscience, 37, 655–673. [PubMed] [CrossRef] [PubMed]
Douglas R. J. Martin K. A. (2004). Neuronal circuits of the neocortex. Annual Review of Neuroscience, 27, 419–451. [PubMed] [CrossRef] [PubMed]
Douglas R. J. Martin K. A. Witteridge D. (1989). A canonical microcircuit for neocortex. Neural Computation, 1, 480–488. [CrossRef]
Duhamel J. R. Colby C. L. Goldberg M. E. The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255, 90–92. [PubMed] [CrossRef] [PubMed]
Eigen M. (1971). Self-organization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58, 465–523. [PubMed] [CrossRef] [PubMed]
Favorov O. V. Diamond M. E. (1990). Demonstration of discrete place-defined columns—segregates—in the cat SI. Journal of Comparative Neurology, 298, 97–112. [PubMed] [CrossRef] [PubMed]
Favorov O. V. Kelly D. G. (1994). Minicolumnar organization within somatosensory cortical segregates: II. Emergent functional properties. Cerebral Cortex, 4, 428–442. [PubMed] [CrossRef] [PubMed]
Fei-Fei L. Fergus R. Perona P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 594–611. [PubMed] [CrossRef] [PubMed]
Gauthier I. Skudlarski P. Gore J. C. Anderson A. W. (2000). Expertise for cars and birds recruits brain areas involved in face recognition. Nature Neuroscience, 3, 191–197. [PubMed] [CrossRef] [PubMed]
Gerstner W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. [PubMed] [CrossRef] [PubMed]
Graf M. (2006). Coordinate transformations in object recognition. Psychological Bulletin, 132, 920–945. [PubMed] [CrossRef] [PubMed]
Grimes D. B. Rao R. P. (2005). Bilinear sparse coding for invariant vision. Neural Computation, 17, 47–73. [PubMed] [CrossRef] [PubMed]
Hinton G. (1981). A parallel computation that assigns canonical object-based frames of reference. Proceedings of the Seventh International Joint Conference on Artificial Intelligence, 2, 683–685.
Hubel D. H. Wiesel T. N. (1977). Ferrier lecture. Functional architecture of macaque visual cortex. Proceedings of the Royal Society of London B: Biological Sciences, 198, 1–59. [PubMed] [CrossRef]
Hung C. P. Kreiman G. Poggio T. DiCarlo J. J. (2005). Fast readout of object identity from macaque inferior temporal cortex. Science, 310, 863–866. [PubMed] [CrossRef] [PubMed]
Johnson J. S. Olshausen B. A. (2003). Timecourse of neural signatures of object recognition. Journal of Vision, 3, (7):4, 499–512, [http://journalofvision.org/3/7/4/, doi:10.1167/3.7.4. [PubMed] [Article] [CrossRef]
Jolicoeur P. (1985). The time to name disoriented natural objects. Memory & Cognition, 13, 289–303. [PubMed] [CrossRef] [PubMed]
Jones E. G. Microcolumns in the cerebral cortex. Proceedings of the National Academy of Sciences of the United States of America, 97, 5019–5021. [PubMed] [Article] [CrossRef] [PubMed]
Jones J. P. Palmer L. A. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58, 1233–1258. [PubMed] [PubMed]
Kanwisher N. (2006). Neuroscience. What's in a face? Science, 311, 617–618. [PubMed] [CrossRef] [PubMed]
Kanwisher N. Yovel G. (2006). The fusiform face area: A cortical region specialized for the perception of faces. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 361, 2109–2128. [PubMed] [Article] [CrossRef]
Kirkpatrick S. Gelatt C. D. Vecchi M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. [PubMed] [CrossRef] [PubMed]
Körner E. Gewaltig M. Körner U. Richter A. Rodemann T. (1999). A model of computation in neocortical architecture. Neural Networks, 12, 989–1005. [PubMed] [CrossRef] [PubMed]
Kree R. Zippelius A. (1988). Recognition of topological features of graphs and images in neural networks. Journal of Physics A, 21, 813–818. [CrossRef]
Kusunoki M. Goldberg M. E. (2003). The time course of perisaccadic receptive field shifts in the lateral intraparietal area of the monkey. Journal of Neurophysiology, 89, 1519–1527. [PubMed] [Article] [CrossRef] [PubMed]
Lades M. Vorbrüggen J. Buhmann J. Lange J. von der Malsburg C. Würtz R. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42, 300–311. [CrossRef]
Lamme V. A. (2003). Why visual attention and awareness are different. Trends in Cognitive Sciences, 7, 12–18. [PubMed] [CrossRef] [PubMed]
Latham P. E. Richmond B. J. Nelson P. G. Nirenberg S. (2000). Intrinsic dynamics in neuronal networks. I. Theory.. Journal of Neurophysiology, 83, 808–827. [PubMed] [Article] [PubMed]
Lawson R. Jolicoeur P. (1999). The effect of prior experience on recognition thresholds for plane-disoriented pictures of familiar objects. Memory & Cognition, 27, 751–758. [PubMed] [CrossRef] [PubMed]
LeCun Y. Huang F. J. Bottou L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. CVPR. IEEE Computer Society, 97–104.
Luck S. J. Chelazzi L. Hillyard S. A. Desimone R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. Journal of Neurophysiology, 77, 24–42. [PubMed] [Article] [PubMed]
Lücke J. Keck C. von der Malsburg C. (2008). Rapid convergence to feature layer correspondences. Neural Computation, 20 2441–2463. [PubMed] [CrossRef] [PubMed]
Lücke J. von der Malsburg C. (2004). Rapid processing and unsupervised learning in a model of the cortical macrocolumn. Neural Computation, 16, 501–533. [PubMed] [CrossRef] [PubMed]
Marti D. Deco G. Giudice P. D. Mattia M. (2006). Reward-biased probabilistic decision-making: Mean-field predictions and spiking simulations. Neurocomputing, 69, 1175–1178. [CrossRef]
Martinez A. Benavente R. (1987). The AR face database. Technical Report 24, CVC.
Mel B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition. Neural Computation, 9, 777–804. [PubMed] [CrossRef] [PubMed]
Messer K. Kittler J. Sadeghi M. Hamouz M. Kostin A. Cardinaux F. (2004). Face authentication test on the BANCA database. Proceedings of the International Conference on Pattern Recognition, 4, 523–532.
Möller C. Lücke J. Zhu J. Faustmann P. M. von der Malsburg C. (2007). Glial cells for information routing? Cognitive Systems Research, 8, 28–35. [CrossRef]
Mountcastle V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. [PubMed] [Article] [CrossRef] [PubMed]
Mountcastle V. B. (2003). Introduction. Cerebral Cortex, 13, 2–4. [PubMed] [Article] [CrossRef] [PubMed]
Muresan R. C. Savin C. (2007). Resonance or integration Self-sustained dynamics and excitability of neural microcircuits. Journal of Neurophysiology, 97, 1911–1930. [PubMed] [Article] [CrossRef] [PubMed]
Olshausen B. A. Anderson C. H. Van Essen D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. [PubMed] [Article] [PubMed]
Peters A. Sethares C. (1996). Myelinated axons and the pyramidal cell modules in monkey primary visual cortex. Journal of Comparative Neurology, 365, 232–255. [PubMed] [CrossRef] [PubMed]
Peters A. Sethares C. (1997). The organization of double bouquet cells in monkey striate cortex. Journal of Neurocytology, 26, 779–797. [PubMed] [Article] [CrossRef] [PubMed]
Peters A. Yilmaz E. (1993). Neuronal organization in area 17 of cat visual cortex. Cerebral Cortex, 3, 49–68. [PubMed] [CrossRef] [PubMed]
Phillips P. Flynnand P. Scruggs T. Bowyer K. Chang J. Hoffman K. (2005). Overview of the face recognition grand challenge. IEEE Conference on Computer Vision and Pattern Recognition. (pp. 947–954). IEEE.
Phillips P. Grother P. Micheals R. Blackburn D. Tabassi E. (2003). Frvt 2002 evaluation report. Technical Report 6965, NISTIR. [http://www.frvt.org/.
Phillips P. Moon H. Rizvi S. Rauss P. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 1090–1104. [CrossRef]
Phillips P. J. Wechsler H. Huang J. Rauss P. J. (1998). The FERET database and evaluation procedure for face recognition algorithms. Image and Vision Computing, 16, 295–306. [CrossRef]
Pinto N. Cox D. D. Dicarlo J. J. (2008). Why is real-world visual object recognition hard? PLoS Computational Biology, 4, e27. [PubMed] [Article] [CrossRef] [PubMed]
Riesenhuber M. Poggio T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 211, 1019–1025. [PubMed]
Ringach D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. Journal of Neurophysiology, 88, 455–463. [PubMed] [Article] [PubMed]
Rockland K. S. Ichinohe N. (2004). Some thoughts on cortical minicolumns. Experimental Brain Research, 158, 265–277. [PubMed] [Article] [CrossRef] [PubMed]
Rosenblatt F. (1961). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Washington, DC: Spartan Books.
Sato Y. D. Jitsev J. von der Malsburg C. (2008). A visual object recognition system invariant to scale and rotation. Proceedings of International Conference on Artificial Neural (pp. 991–1000). Heidelberg, Germany: Springer-Verlag.
Serre T. Wolf L. Bileschi S. Riesenhuber M. Poggio T. (2007). Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 411–426. [PubMed]
Summerfield C. Egner T. Greene M. Koechlin E. Mangels J. Hirsch J. (2006). Predictive codes for forthcoming perception in the frontal cortex. Science, 314, 1311–1314. [PubMed]
Tan X. Chen S. Zhou Z. H. Zhang F. (2005). Recognizing partially occluded, expression variant faces from single training image per person with SOM and soft kappa-NN ensemble. IEEE Transactions on Neural Networks, 16, 875–886. [PubMed] [CrossRef] [PubMed]
Tanaka K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. [PubMed] [CrossRef] [PubMed]
Tanaka K. (2003). Columns for complex visual object features in the inferotemporal cortex: Clustering of cells with similar but slightly different stimulus selectivities. Cerebral Cortex, 13, 90–99. [PubMed] [Article] [CrossRef] [PubMed]
Tarr M. J. Gauthier I. (2000). FFA: A flexible fusiform area for subordinate-level visual processing automatized by expertise. Nature Neuroscience, 3, 764–769. [PubMed] [CrossRef] [PubMed]
Thorpe S. (1988). Identification of rapidly presented images by the human visual system. Perception, 17, A77.
Thorpe S. Fize D. Marlot C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. [PubMed] [CrossRef] [PubMed]
Tsao D. Y. Freiwald W. A. Tootell R. B. Livingstone M. S. (2006). A cortical region consisting entirely of face-selective cells. Science, 311, 670–674. [PubMed] [CrossRef] [PubMed]
van Vreeswijk C. Sompolinsky H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. [PubMed] [CrossRef] [PubMed]
Wallis G. (2001). Linear models of simple cells: Correspondence to real cell responses and space spanning properties. Spatial Vision, 14, 237–260. [PubMed] [CrossRef] [PubMed]
Walther D. Itti L. Riesenhuber M. Poggio T. Koch C. (2002. Attentional selection for object recognition—A gentle way. Proceedings of the Second International Workshop on Biologically Motivated Computer Vision. (pp. 472–479). London, U.K.: Springer-Verla.
Weber C. Wermter S. (2007). A self-organizing map of sigma-pi units. Neurocomputing, 70, 2552–2560. [CrossRef]
Wilson H. R. Cowan J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. [PubMed] [CrossRef] [PubMed]
Wiskott L. (1999). Role of topographical constraints in face recognition.. Pattern Recognition Letters, 20, 89–96. [CrossRef]
Wiskott L. Fellous J. Krüger N. von der Malsburg C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 775–779. [CrossRef]
Wiskott L. von der Malsburg C. Sirosh J. Miikkulainen R. Choe Y. (1996). Face recognition by dynamic link matching. Lateral interactions in the cortex: Structure and function(chap. 11). ISBN: 0-9647060-0-8.
Wolfrum P. Lücke J. von der Malsburg C. (2008). Invariant face recognition in a network of cortical columns. Proceedings of International Conference on Computer Vision Theory and Applications, 2, 38–45.
Wolfrum P. von der Malsburg C. (2007a). A markerbased model for the ontogenesis of routing circuits. Artificial neural networks—ICANN 2007,. Vol. 4669 of LNCS (vol. 4669, pp. 1–8). Heidelberg, Germany: Springe.
Wolfrum P. von der Malsburg C. (2007b). What is the optimal architecture for visual information routing? Neural Computation, 19, 3293–3309. [PubMed] [CrossRef]
Wolfrum P. von der Malsburg C. (2008. Attentional processes in correspondence-based object recognition. Proceedings of COSYNE, 330. http://cosyne.org/cosyne08/posters/COSYNE2008_0226_poster.pdf.
Womelsdorf T. Anton-Erxleben K. Pieper F. Treue S. (2006). Dynamic shifts of visual receptive fields in cortical area MT by spatial attention.. Nature Neuroscience, 9, 1156–1160. [PubMed] [CrossRef] [PubMed]
Wundrich I. J. von der Malsburg C. Würtz R. P. (2004). Image representation by complex cell responses. Neural Computation, 16, 2563–2575. [PubMed] [CrossRef] [PubMed]
Würtz R. P. Moreno-Díaz R. Mira-Mira J. (1995). Building visual correspondence maps—from neural dynamics to a face recognition system. Brain processes, theories and models. (pp. 420–429). Cambridge, USA: MIT Pres.
Yoshimura Y. Dantzker J. L. Callaway E. M. (2005). Excitatory cortical neurons form fine-scale functional networks. Nature, 433, 868–873. [PubMed] [CrossRef] [PubMed]
Yuille A. Kersten D. (2006). Vision as Bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10, 301–308. [PubMed] [CrossRef] [PubMed]
Zhao W. Chellappa R. Phillips P. J. Rosenfeld A. (2003). Face recognition: A literature survey. ACM Computing Surveys, 53, 399–458. [CrossRef]
Zhu J. von der Malsburg C. (2004). Maplets for correspondence-based object recognition. Neural Networks, 17, 1311–1326. [PubMed] [CrossRef] [PubMed]
Figure 1
 
The visual correspondence problem is the task of linking corresponding points between two images. (A) Input and model images are represented by arrays of feature nodes (black circles). All potential correspondences are symbolized by lines between the feature nodes. The correct correspondences are indicated as black lines. (B) Examples of wrong correspondences, although connecting equal feature types.
Figure 1
 
The visual correspondence problem is the task of linking corresponding points between two images. (A) Input and model images are represented by arrays of feature nodes (black circles). All potential correspondences are symbolized by lines between the feature nodes. The correct correspondences are indicated as black lines. (B) Examples of wrong correspondences, although connecting equal feature types.
Figure 2
 
Principle of object recognition in our system. The system has to simultaneously represent information about position and identity of the input face and its parts. Positional information is represented by dynamic links establishing correspondences between points in the input image and in the internal reference frame (“Assembly Layer”). Identity information is represented by the activity of Gallery units, different graphs storing memories of different faces. Both modalities contribute to the internal Assembly Layer, which reconstructs the visual input information.
Figure 2
 
Principle of object recognition in our system. The system has to simultaneously represent information about position and identity of the input face and its parts. Positional information is represented by dynamic links establishing correspondences between points in the input image and in the internal reference frame (“Assembly Layer”). Identity information is represented by the activity of Gallery units, different graphs storing memories of different faces. Both modalities contribute to the internal Assembly Layer, which reconstructs the visual input information.
Figure 3
 
Typical time course of the unit activities in an isolated decision column. The inputs to the K = 10 units are spread equidistantly between 0 and 1. The competition parameter ν rises from 0 to 1 during a time period of T = 400 τ. Note that the WTA behavior seen here results directly from the growth of the competition parameter ν. The internal dynamics of a column is much faster, so that with respect to the slow growth of ν, a column is always in quasi-steady state. This can be seen also in the fast rise of the unit activities from very small initial values to the significantly higher steady states.
Figure 3
 
Typical time course of the unit activities in an isolated decision column. The inputs to the K = 10 units are spread equidistantly between 0 and 1. The competition parameter ν rises from 0 to 1 during a time period of T = 400 τ. Note that the WTA behavior seen here results directly from the growth of the competition parameter ν. The internal dynamics of a column is much faster, so that with respect to the slow growth of ν, a column is always in quasi-steady state. This can be seen also in the fast rise of the unit activities from very small initial values to the significantly higher steady states.
Figure 4
 
Different representations of facial images. A rectangular grid graph (A) is used for input image representation, a face graph (B) consisting of characteristic points (landmarks) is a dedicated data structure used for internal face representation.
Figure 4
 
Different representations of facial images. A rectangular grid graph (A) is used for input image representation, a face graph (B) consisting of characteristic points (landmarks) is a dedicated data structure used for internal face representation.
Figure 5
 
Architecture of our network. The gray oval structures represent columns (the vertical ones feature columns, the horizontal ones decision columns), with units as lighter cylinders inside. The numbers of units and columns shown here are chosen exemplarily for visualization purposes only and are not identical to the real numbers of units used in this work. The Input Layer is organized in a rectangular grid (represented by the light blue lines connecting columns), while both the Assembly Layer and the Gallery Layer have face graph topology. At each landmark in the Assembly Layer there are three columns, two feature columns of the Input Layer and Gallery Assembly, and one control column. Input and Assembly are connected all to all (shown exemplarily for the left lowermost point in the Assembly Layer), while Assembly landmarks are connected only to the same landmarks in Gallery but to all identity units there (see also Figure 6). The green lines connecting the three layers and the subset of green highlighted (= activated) Gallery units represent a possible final state of the network.
Figure 5
 
Architecture of our network. The gray oval structures represent columns (the vertical ones feature columns, the horizontal ones decision columns), with units as lighter cylinders inside. The numbers of units and columns shown here are chosen exemplarily for visualization purposes only and are not identical to the real numbers of units used in this work. The Input Layer is organized in a rectangular grid (represented by the light blue lines connecting columns), while both the Assembly Layer and the Gallery Layer have face graph topology. At each landmark in the Assembly Layer there are three columns, two feature columns of the Input Layer and Gallery Assembly, and one control column. Input and Assembly are connected all to all (shown exemplarily for the left lowermost point in the Assembly Layer), while Assembly landmarks are connected only to the same landmarks in Gallery but to all identity units there (see also Figure 6). The green lines connecting the three layers and the subset of green highlighted (= activated) Gallery units represent a possible final state of the network.
Figure 6
 
Information flow in our system. Visual information in form of Gabor jets J extracted from an input image activates the Input Layer I . It flows to the Assembly Layer (Input Assembly, I ⁢ A ) and from there to the Gallery G , where it activates via receptive fields v some memories more strongly than others. Information representing the active memories (stored in projection fields w analogous to v) flows back to the Gallery Assembly G ⁢ A . Information flow I I ⁢ A from the Input Layer to the Input Assembly is modulated by the control units C , which in turn are driven by the similarity of those image patches in the Input Layer and the Gallery Assembly that they connect. By activating those control units that connect positions of the Input Layer containing similar information as the Gallery Assembly, the system effectively focuses on those parts of the input image that contain visual information most similar to the current reconstruction in the Gallery Assembly, formed by superposition of active units in the Gallery Layer. The thick black arrows represent the competition among the decision columns of which the Gallery and the control columns consist. The symbols correspond to those used in the text.
Figure 6
 
Information flow in our system. Visual information in form of Gabor jets J extracted from an input image activates the Input Layer I . It flows to the Assembly Layer (Input Assembly, I ⁢ A ) and from there to the Gallery G , where it activates via receptive fields v some memories more strongly than others. Information representing the active memories (stored in projection fields w analogous to v) flows back to the Gallery Assembly G ⁢ A . Information flow I I ⁢ A from the Input Layer to the Input Assembly is modulated by the control units C , which in turn are driven by the similarity of those image patches in the Input Layer and the Gallery Assembly that they connect. By activating those control units that connect positions of the Input Layer containing similar information as the Gallery Assembly, the system effectively focuses on those parts of the input image that contain visual information most similar to the current reconstruction in the Gallery Assembly, formed by superposition of active units in the Gallery Layer. The thick black arrows represent the competition among the decision columns of which the Gallery and the control columns consist. The symbols correspond to those used in the text.
Figure 7
 
Interaction among control units to achieve a topologically consistent (i.e., continuous) mapping. The unit controlling the blue link strengthens control units in neighboring columns that represent links of similar orientation. Maximal cooperation would occur with perfectly parallel links (the green dashed axes of the gray cones). Since in reality links in the network only exist to the nodes of the input grid (full lines), the strength of cooperation (represented by the shade of green) depends on the degree of parallelity with the blue link, equivalent to the distance of a link's end point from the cone center.
Figure 7
 
Interaction among control units to achieve a topologically consistent (i.e., continuous) mapping. The unit controlling the blue link strengthens control units in neighboring columns that represent links of similar orientation. Maximal cooperation would occur with perfectly parallel links (the green dashed axes of the gray cones). Since in reality links in the network only exist to the nodes of the input grid (full lines), the strength of cooperation (represented by the shade of green) depends on the degree of parallelity with the blue link, equivalent to the distance of a link's end point from the cone center.
Figure 8
 
The process (from top to bottom) of finding the correct mapping between the Input Layer and the Input Assembly. Each row shows the control unit activities on the left side, and on the right first the constant input image, and then an image reconstructed from the activities of the 48 landmarks of the Input Assembly. Initially, the control units have all nearly identical activity, and therefore the Input Assembly receives a superposition of all input information, resulting in the same uniform image information at all landmarks (row one). With the control units developing a topologically consistent match between Input and Input Assembly (rows two and three), this image starts to differentiate toward a normalized (i.e., shifted and deformed if necessary) version of the input image. The mapping via the control units is also visualized by the colored lines connecting the input image with the Input Assembly. Each line represents the “center of mass” of a control column, i.e., the location in the input image where its units are pointing to as a group, weighted by their activity. Click here to view movies of the correspondence finding process.
Figure 8
 
The process (from top to bottom) of finding the correct mapping between the Input Layer and the Input Assembly. Each row shows the control unit activities on the left side, and on the right first the constant input image, and then an image reconstructed from the activities of the 48 landmarks of the Input Assembly. Initially, the control units have all nearly identical activity, and therefore the Input Assembly receives a superposition of all input information, resulting in the same uniform image information at all landmarks (row one). With the control units developing a topologically consistent match between Input and Input Assembly (rows two and three), this image starts to differentiate toward a normalized (i.e., shifted and deformed if necessary) version of the input image. The mapping via the control units is also visualized by the colored lines connecting the input image with the Input Assembly. Each line represents the “center of mass” of a control column, i.e., the location in the input image where its units are pointing to as a group, weighted by their activity. Click here to view movies of the correspondence finding process.
Figure 9
 
Time course (from top to bottom) of the Gallery unit activities (left) and of the resulting image representation in the Gallery Assembly (right). The Gallery Assembly gets input from all Gallery units and thus contains an activity-weighted average of all faces in the gallery. Initially, when all Gallery units are nearly equally active, this weighted average is a real average of all gallery faces, i.e., a mean face (uppermost row). With ongoing dynamics and rising competition, the Gallery units fitting the input image better get stronger, and the Gallery Assembly activity develops toward the respective gallery faces. Finally, only one unit of all Gallery columns is active, and the Gallery Assembly contains a representation of the image the system has recognized (which is not identical to the input image in most applications, cf. input image in Figure 8). Click here to view movies showing the identification process.
Figure 9
 
Time course (from top to bottom) of the Gallery unit activities (left) and of the resulting image representation in the Gallery Assembly (right). The Gallery Assembly gets input from all Gallery units and thus contains an activity-weighted average of all faces in the gallery. Initially, when all Gallery units are nearly equally active, this weighted average is a real average of all gallery faces, i.e., a mean face (uppermost row). With ongoing dynamics and rising competition, the Gallery units fitting the input image better get stronger, and the Gallery Assembly activity develops toward the respective gallery faces. Finally, only one unit of all Gallery columns is active, and the Gallery Assembly contains a representation of the image the system has recognized (which is not identical to the input image in most applications, cf. input image in Figure 8). Click here to view movies showing the identification process.
Table 1
 
Recognition rates of our system compared to those reported in the literature. Results are given in %, the best performance in any testing category typed in bold. The first column shows the recognition rates of our system (in %) for different probe sets of the FERET and the AR databases. The following three columns (A, B, and C) show those systems evaluated in Phillips et al. (2000) and Tan et al. (2005), which performed best in at least one category. Note that these systems were mostly only tested on either the FERET or the AR database.
Table 1
 
Recognition rates of our system compared to those reported in the literature. Results are given in %, the best performance in any testing category typed in bold. The first column shows the recognition rates of our system (in %) for different probe sets of the FERET and the AR databases. The following three columns (A, B, and C) show those systems evaluated in Phillips et al. (2000) and Tan et al. (2005), which performed best in at least one category. Note that these systems were mostly only tested on either the FERET or the AR database.
Recognition rates [%] Our system A B C
FERET fafb 95 95 92
duplicate I 47 59
duplicate II 26 52
AR Emotion 91 95 82
Em. duplicate 61 81 82
Occlusion 73 96 81
Occ. duplicate 36 56 51
Supplementary Movie
Supplementary Movie
Supplementary Movie
Supplementary Movie
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×