Abstract
HMAX (Serre et al. 2005), a model of processing in the primate visual cortex, has been referred to (by its authors) as the “standard model.” HMAX extends a classical Gabor-filter model of V1 by interleaving layers performing spatial pooling (to achieve invariance) with layers computing feature conjunctions, some learned, to achieve more complex features.
From object line drawings we produced local feature deleted (LFD) complementary pairs (A, B) by deleting every other vertex from one image and the alternating vertices from the other. We scrambled the contour fragments of A (by translation only) generating A_SCR, then conducted match-to-sample trials (Is A more similar to B or A_SCR?). Subjects invariably chose B. HMAX chose A_SCR in 95% of trials. With learned features, HMAX performed close to chance. In a separate test, we created match-to-sample trials where the target depicted Object1, the first test image also depicted Object1 but with complementary vertices, the second test image depicted Object2 but matched in local vertex content to the target. HMAX (with and without learned features) matched Object2 to Object1—exactly opposite to what humans did.
HMAX fails on these tests because it perceives an object only as a list of features. Parts-based structural descriptions (SD) can explain these results because LFD complements contain information sufficient for the same parts to be extracted. Recent single unit studies in IT are supportive of SDs. Yamane et al. (2006) reported IT neurons that were tuned to individual parts and relations between parts. We describe ways to revise feature hierarchy models (like HMAX) to achieve a model of human performance that more closely accommodates both our own psychophysical experiments as well as the neural data.