Abstract
A popular method for representing images is to compute histograms of pixel intensities, wavelet responses, or the outputs of more complex filters that are tuned to specific shapes (e.g. Riesenhuber & Poggio, 1999). Because these representations do not retain the locations of the activated filters, they may not be well suited for the analysis of configural relations among image features. The present study was designed to address this issue.
Method: We created 4 classes of objects based on the classic Vernier acuity and bisection tasks. Each object was composed of 3 irregularly shaped white dots embedded in a larger irregular black disc. Class membership was determined by whether the dots were arranged collinearly and whether they were equally spaced. Naive observers were trained to classify the stimuli in two separate conditions: One in which they were trained and tested with all possible stimulus orientations, and a second in which they were trained with one set of orientations and then tested with another. We also evaluated these same two conditions using a recent implementation by Mutch & Lowe (2006) of the HMAX model originally developed by Riesenhuber & Poggio (1999).
Results: Most subjects exceeded 85% accuracy in both conditions after ∼180 exposures to each class. Performance was much worse for the HMAX model. When trained with 300 images from each class, its average classification accuracy was only 31%, and was reduced to chance (i.e. 26%) in the transfer condition.
Conclusion: Human observers can easily learn to classify objects based on configural properties such as collinear alignment or bisection, but similar performance cannot be achieved using histograms of higher order features as implemented in the HMAX model. These findings identify an important limitation of representing images with histograms of filter activations without retaining the relative spatial locations of those filters.
Supported by NSF BCS-0962119.