We quantified the differences between the V1 normalization schemes by testing how well the respective V2 units could be used to perform perceptual classification tasks. We found that the two-stage architectures based on V1 models with surround normalization perform substantially better than those without surround in two object recognition tasks (
Figures 6 and
7). This result confirms a previous finding by Jarrett et al. (
2009), but our approach based on image statistics provides further insight into why this is the case. In our framework, surround normalization allows V1 neurons to accurately represent natural images by discarding global, uninformative image structure, such as contrast; in more technical terms, normalization amounts to Bayesian inference in the MGSM generative model of images. The better performance achieved by models that include normalization can thus be explained by the reduced V1 output correlations (
Figure 4; also Tripp [
2012] provided similar results in orientation discrimination with noisy neurons) and the fact that features learned downstream without supervision are more informative about object identity (
Figures 6 and
7 show that only the first few, high-variance V2 PCs are informative). Shan and Cottrell (
2008) used a similar approach and also noticed that informative features could be learned in a two-stage system with a nonlinearity that produced Gaussianization of the V1 outputs; they also suggested that the improved classification performance can be partly explained by the expansion of dimensionality in the features represented by the second stage (a sort of “Kernel trick”). We note also that the object recognition performance levels we reported fall short of the state of the art (e.g., Boureau, Bach, LeCun, & Ponce,
2010; Coates, Lee, & Ng,
2011; Jarrett et al.,
2009; Pinto, Cox, & Dicarlo,
2008). A number of known factors are likely to play a role, such as using a finer sampling of orientations and spatial scales (Pinto et al.,
2008) and positions (Coates, Lee, & Ng,
2011), using sparse features (Boureau et al.,
2010), using max rather than average pooling (Jarrett et al.,
2009); moreover, in our implementation, there are no free parameters to be learned with supervision (other than the classifier). However, the maximal performance we obtain is similar to that reported using implementations whose details are more comparable with ours (e.g., in Jarrett et al. [
2009], a two-stage system initialized with Gabor filters at four orientations remains well below 60% on Caltech101).