Abstract
Neurally-inspired approaches to object recognition often involve a hierarchy of feature-detecting neurons that become increasingly shape specific and spatially invariant in alternating processing stages (Fukushima 1982); the approach is loosely patterned after the progression of receptive field properties seen in the primate ventral visual processing stream. A key question in the design of such systems is how the parameters of the feature-extraction hierarchy should be set, that is, (1) which lower-order features should be bound together into higher-order combinations to increase shape selectivity at each stage, and (2) what kind and degree of spatial pooling should be carried out at each stage to increase spatial invariance. Several previous workers have explored “trace”-based learning rules, in which the spatio-temporal contiguity of visual features acts as a cue to the learning of higher-order invariant features (e.g. Foldiak, 1991; Wallis & Rolls, 1997; Wiskott & Sejnowski, 2002). However a system designed to cope with a difficult full viewpoint-invariant 3D object recognition task involving a large set of self-similar objects has yet to emerge. We are developing a hierarchical feature-learning architecture based on a set of nested self-organizing maps (SOMs) biased to learn either a binding or a pooling operation at each stage. Our training set contains multiple instances of chairs; features are extracted from interactive 3-D virtual reality movies. The network has so far been developed and tested through the level of V1-like simple & complex cells; development of the V2, V4, and IT layers is ongoing.