One potential strategy for overcoming this problem might be to employ higher order templates for extracting more complex aspects of local image structure. We know, for example, that the primate visual cortex has a hierarchical structure, in which neurons in the earliest stages behave much like Gabor filters (e.g., De Valois & De Valois,
1988; Hubel & Wiesel,
1968), whereas those farther along the ventral stream are tuned to more complex visual features and exhibit a greater degree of position invariance (e.g., Desimone, Albright, Gross, & Bruce,
1984; Tanaka, Saito, Fukada, & Moriya,
1991; see Ungerlieder & Bell,
2011, for a recent review). Some of the most successful models of object recognition have been designed to mimic this type of organization and are referred to in the literature as
feature hierarchy models (e.g., Fukushima,
1980; Perret & Oram,
1993; Wallis & Rolls,
1997). These models can be implemented with a predetermined set of higher order filters (e.g., Riesenhuber & Poggio,
1999), or they can be trained to learn a set of features from a training set (e.g., LeCun, Haffner, & Bottou,
1999; Mutch & Lowe,
2008; Serre, Oliva, & Poggio,
2007). For any given input image, the responses of these filters define a vector in a high-dimensional space, and a standard classifier such as a support vector machine is used to determine the most appropriate response from the set of categories for which the model has been trained (e.g., houses, cars, pianos, etc.). Although these models allow for some distortions of the input, it is not at all obvious how they could successfully cope with the wide range of structural variations for which human observers can identify configural categories.