Given that the best-performing layer of pretrained AlexNet emerged in convolutional layer 1, this suggests that much simpler shallow models should have the potential to perform well at accounting for the response properties of V1 neurons. Whereas AlexNet acquires its front-end filters from supervised training at object classification, another approach could be to use hand-engineered filters to model V1 responses, as one might be able to choose a more evenly distributed or complete set of basis filters.
We were therefore motivated to perform a parallel set of analyses on this V1 dataset using a Gabor-based pyramid model (
Kay, Naselaris, Prenger, & Gallant, 2008;
Lee, 1996) to test the predictive performance of a base model and evaluate whether additional nonlinear operations are necessary to account for V1 responses to complex natural images. Our Gabor pyramid consisted of an array of simulated simple cell and complex cell responses that spanned four spatial scales, eight orientations, and four spatial phases (see Materials and Methods). This array of simulated units was used to calculate response patterns to each image, after which regularized regression was used to predict the responses of individual V1 neurons. Our evaluation of the base V1 Gabor model revealed that it predicted V1 responses quite well and actually exhibited a marginally significant advantage when compared with conv_1 of pretrained AlexNet (mean
r = 0.4946 vs. 0.4887),
t(165) = 1.92,
p = 0.057.
Moreover, it is well documented that V1 responses increase monotonically as a function of stimulus contrast but in a nonlinear manner, as response saturation or compression occurs at higher stimulus contrasts (
Albrecht & Geisler, 1991;
Boynton, Demb, Glover, & Heeger, 1999;
Ohzawa, Sclar, & Freeman, 1985;
Sclar, Maunsell, & Lennie, 1990;
Skottun et al., 1991;
Tong, Harrison, Dewey, & Kamitani, 2012). Thus, one of the simplest forms of nonlinearity that can be applied to a modeled V1 response, after half-wave rectification, is some type of contrast saturation effect. We mimicked the effects of contrast saturation by applying an exponentiation function to the output response of modeled simple cell and complex cell units, using exponent values ranging from 0.1 to 1.0 with increments of 0.1. This analysis revealed that compression of the Gabor-based responses with an exponent value of 0.6 (i.e., close to a square root function) led to a quantitatively modest but statistically significant improvement in the prediction of V1 responses when compared to the base model with no compression (mean
r of 0.4979 vs. 0.4946, respectively),
t(165) = 3.94,
p = 1.22 × 10
−4. Moreover, this version of the Gabor model performed significantly better than conv_1 of pretrained AlexNet,
t(165) = 3.13,
p = 0.002. Although our Gabor model with contrast saturation did not quite reach the predictive accuracy of the best-performing layer of VGG-19 or modified AlexNet, the differences in performance between these CNN models and our Gabor model were quite modest, with CNNs showing only a modest advantage as the differences in mean
r fell below a value of 0.02 (
Figure 7).
The effects of both cross-orientation inhibition and orientation-tuned surround suppression have been extensively studied in V1 (
Bair, Cavanaugh, & Movshon, 2003;
Bonds, 1989;
Busse, Wade, & Carandini, 2009;
Cavanaugh, Bair, & Movshon, 2002;
Deangelis, Freeman, & Ohzawa, 1994;
Deangelis, Robson, Ohzawa, & Freeman, 1992;
Jones, Grieve, Wang, & Sillito, 2001;
Morrone, Burr, & Maffei, 1982;
Nurminen, Merlin, Bijanzadeh, Federer, & Angelucci, 2018;
Poltoratski & Tong, 2020;
Vinje & Gallant, 2002), and are commonly implemented in computational models by assuming divisive normalization (
Carandini et al., 1997;
Heeger, 1992b). Divisive normalization is believed to reflect a canonical neural computation that occurs in most brain areas, in which the feedforward response of each excitatory neuron is divisively modified by the summed activity of its neighbors, presumably as a consequence of some form of local inhibition (
Carandini & Heeger, 2012). Although cross-orientation inhibition and surround suppression can be readily demonstrated by using artificial stimuli that are specifically tailored to test for these effects (e.g., sinewave gratings), it is unclear whether such modulatory interactions would be readily detectable when evaluating V1 responses to a large diverse set of natural (and synthetic) images.
For our first set of analyses, we constructed V1 Gabor models with different types of divisive normalization. One model mimicked cross-orientation inhibition by applying normalization across all oriented units of a given spatial scale in a location-specific manner. Another model mimicked the effects of surround suppression by applying normalization across spatially neighboring units (using a 2D Gaussian window) that shared a common orientation and spatial frequency preference, and a third model applied spatial normalization without selectivity for orientation. For each model, a constant additive term in the denominator was allowed to vary as a free parameter to modify the strength of divisive normalization (see Materials and Methods).
In comparison to the base model (mean
r = 0.4946), we found that normalization led to significantly better performance for the cross-orientation inhibition model (mean
r = 0.4976),
t(165) = 4.32,
p = 2.64 × 10
−5; the orientation-tuned surround suppression model (mean
r = 0.5002),
t(165) = 7.39,
p = 7.12 × 10
−12; and the non-selective surround suppression model (mean
r = 0.4992),
t(165) = 6.33, p = 2.21 × 10
−9. Although the above findings provide tentative evidence that normalization may affect V1 neuronal responses to complex images, it is important to consider whether normalization would further improve the prediction of V1 responses after the nonlinear effect of contrast saturation is taken into consideration. We therefore implemented the contrast saturation effect with an exponent of 0.6 and then performed divisive normalization with the models described above. In comparison with the contrast saturation model without normalization (mean
r = 0.4979), the model with both contrast saturation and cross-orientation inhibition did not show significant gains in performance (mean
r = 0.4977),
t(165) = 0.58,
p = 0.56. Likewise, the model with contrast saturation and orientation-tuned surround suppression did not outperform the model with contrast saturation alone (mean
r = 0.4982),
t(165) = 0.2869,
p = 0.77. However, we did observe significant improvement in V1 predictions for the contrast-saturated model with non-selective surround suppression (mean
r = 0.4991),
t(165) = 2.17,
p = 0.03, when compared to the contrast saturation model without normalization. One interpretation of these findings is that normalization is less influential if one presupposes the existence of a mechanism to mediate contrast saturation in V1 neurons. However, an alternative interpretation is that some form of normalization (e.g., cross-orientation inhibition) is responsible for causing contrast saturation in V1, as has been suggested by prior research (
Heeger, 1992b).
Our analyses of these V1 responses using multiple variants of a Gabor pyramid model provide positive evidence of additional nonlinear computations that take place prior to or within V1, which remain detectable in neural responses to complex natural images (
Coen-Cagli, Kohn, & Schwartz, 2015;
Vinje & Gallant, 2002). It is also worth noting that, although the potential contributions of divisive normalization appear quite modest in this study of V1 responses to natural and complex images, much more powerful effects have been documented in studies that rely on artificial stimuli (e.g., gratings) and tailored experimental conditions to test for the effects of cross-orientation inhibition and surround suppression (e.g.,
Busse et al., 2009;
Cavanaugh et al., 2002). Thus, whether it is best to investigate the response properties of V1 using natural images or artificial stimuli may depend on the nature of the neuroscientific question to be tested (
Felsen & Dan, 2005;
Kay et al., 2008;
Rust & Movshon, 2005).
Although our best-performing Gabor model with orientation-tuned surround suppression could predict V1 responses suitably well (mean r = 0.5002), it was still outperformed by VGG-19 (r = 0.5158) and modified AlexNet (r = 0.5167) by a modest but statistically significant margin. Given that these V1 Gabor models are arguably much simpler, more parsimonious and more readily interpretable, how should these factors be weighted in comparison with the more complex and less readily interpretable CNN models? The preferred model of choice may depend on the goals of the study or the preferred theoretical framework of the researcher. From our own perspective, we suspect that the simpler Gabor-based model with additional normalization mechanisms may account for a majority of V1 tuning properties, whereas a subset of V1 neurons may have more complex receptive fields that arise from a longer sequence of nonlinear operations. Whether it might be possible to adjudicate between best-fitting models at the single-neuron level remains a potentially interesting and challenging question for future studies to explore.