Open Access
Article  |   June 2024
Convolutional neural network models applied to neuronal responses in macaque V1 reveal limited nonlinear processing
Author Affiliations
  • Hui-Yuan Miao
    Department of Psychology, Vanderbilt University, Nashville, TN, USA
    huiyuan.miao@vanderbilt.edu
  • Frank Tong
    Department of Psychology, Vanderbilt University, Nashville, TN, USA
    Vanderbilt Vision Research Center, Vanderbilt University, Nashville, TN, USA
    frank.tong@vanderbilt.edu
Journal of Vision June 2024, Vol.24, 1. doi:https://doi.org/10.1167/jov.24.6.1
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Hui-Yuan Miao, Frank Tong; Convolutional neural network models applied to neuronal responses in macaque V1 reveal limited nonlinear processing. Journal of Vision 2024;24(6):1. https://doi.org/10.1167/jov.24.6.1.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Computational models of the primary visual cortex (V1) have suggested that V1 neurons behave like Gabor filters followed by simple nonlinearities. However, recent work employing convolutional neural network (CNN) models has suggested that V1 relies on far more nonlinear computations than previously thought. Specifically, unit responses in an intermediate layer of VGG-19 were found to best predict macaque V1 responses to thousands of natural and synthetic images. Here, we evaluated the hypothesis that the poor performance of lower layer units in VGG-19 might be attributable to their small receptive field size rather than to their lack of complexity per se. We compared VGG-19 with AlexNet, which has much larger receptive fields in its lower layers. Whereas the best-performing layer of VGG-19 occurred after seven nonlinear steps, the first convolutional layer of AlexNet best predicted V1 responses. Although the predictive accuracy of VGG-19 was somewhat better than that of standard AlexNet, we found that a modified version of AlexNet could match the performance of VGG-19 after only a few nonlinear computations. Control analyses revealed that decreasing the size of the input images caused the best-performing layer of VGG-19 to shift to a lower layer, consistent with the hypothesis that the relationship between image size and receptive field size can strongly affect model performance. We conducted additional analyses using a Gabor pyramid model to test for nonlinear contributions of normalization and contrast saturation. Overall, our findings suggest that the feedforward responses of V1 neurons can be well explained by assuming only a few nonlinear processing stages.

Introduction
The primary visual cortex is arguably the best understood cortical area and has served as a critical testbed for developing neurocomputational models of visual processing. Since the pioneering work of Hubel and Wiesel (1962), researchers have sought to characterize and understand the tuning properties of V1 simple cells and complex cells (Carandini et al., 2005; Mechler & Ringach, 2002; Priebe, 2016; Priebe, Mechler, Carandini, & Ferster, 2004; Ringach, Shapley, & Hawken, 2002). Simple cells respond to preferred orientations in a position-specific manner, whereas complex cells are believed to pool information from multiple simple cells that share a common orientation preference. Early conceptual models of V1 neuronal tuning (Hubel & Wiesel, 1962) helped set the foundation for subsequent computational models. Jones and Palmer (1987) modeled simple cell responses by using two-dimensional (2D) Gabor filters to account for the orientation of a neuron, spatial frequency, and phase tuning preferences. Likewise, complex cells can be modeled by squaring and summing the outputs from a quadrature pair of linear filters to obtain phase-invariant responses to stimuli (Adelson & Bergen, 1985). Subsequent work has shown that the response properties of simple and complex cells vary along a continuum (Mechler & Ringach, 2002; Ringach et al., 2002), and that the outputs from multiple linear filters can be combined to obtain better predictions of V1 responses (Rust, Schwartz, Movshon, & Simoncelli, 2005; Vintch, Movshon, & Simoncelli, 2015). For example, a top-performing convolutional subunit model consisted of multiple spatially shifted copies of a single linear filter, followed by nonlinear rectification, weighted pooling of these subunit responses, and a final nonlinearity applied to the pooled response (Vintch et al., 2015). A key property of these models is that they require only a few nonlinear processing stages to account for V1 neuronal responses. 
By contrast, a recent study by Cadena et al. (2019) suggested that V1 neurons perform far more nonlinear computations than previously expected. The authors used the layer-wise unit activity of a deep convolutional neural network (CNN), VGG-19 (Simonyan & Zisserman, 2015), to predict neuronal responses in macaque V1 to thousands of natural and synthetic images (see Figure 1). VGG-19 performs convolutional filtering followed by nonlinear rectification in each convolutional layer and max-pooling operations every few layers to learn effective representations for object classification. VGG-19 was found to outperform traditional V1 models, and, more surprisingly, the best performance occurred not in the lower layers of the CNN but rather, in an intermediate layer (conv3_1) following five convolutional operations and two max-pooling operations. These findings led the authors to conclude that a large number of nonlinear operations are required to fully capture the response properties of V1 neurons. If true, this would imply that our understanding of the neural computations performed within area V1 was widely off the mark and that a major reconceptualization is required. 
Figure 1.
 
Examples of the stimuli used in this study. The leftmost column shows examples of two natural images. Columns two to five show synthetic images derived from the natural images to evoke similar responses in the lower layers of VGG-19, with increasing correspondence across multiple layers shown from left to right.
Figure 1.
 
Examples of the stimuli used in this study. The leftmost column shows examples of two natural images. Columns two to five show synthetic images derived from the natural images to evoke similar responses in the lower layers of VGG-19, with increasing correspondence across multiple layers shown from left to right.
However, given the complexity of deep CNN models, it may not always be obvious as to why one CNN layer performs better at predicting neural responses in a particular visual area when compared to another layer (Guclu & van Gerven, 2015; Khaligh-Razavi & Kriegeskorte, 2014; Schrimpf et al., 2020; Yamins et al., 2014). The visual representations that are learned by the convolutional layers of a CNN are constrained not only by the number of nonlinear operations that occur by a given stage of processing but also by model architecture parameters such as filter size and stride that determine the effective receptive field size of the units in each layer. In particular, VGG-19 relies on small convolutional filters (3 × 3 pixels) that are sampled by the subsequent layers with a stride of 1, leading to very small receptive fields in the lower layers (see Table 1). We hypothesized that the mismatch between the image size used to predict V1 responses (40 × 40 pixels) and the receptive field size of units in the lower layers of VGG-19 may have caused systematic biases in model performance to favor higher layers with larger receptive fields, as fewer CNN units would then be needed to predict or account for the response properties of a V1 neuron. 
Table 1.
 
The architecture of standard VGG-19. The table shows the output dimensionality of each CNN layer and the associated kernel size, stride length and receptive field size in pixel units.
Table 1.
 
The architecture of standard VGG-19. The table shows the output dimensionality of each CNN layer and the associated kernel size, stride length and receptive field size in pixel units.
To address these concerns, we compared the neural predictivity of VGG-19 with AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), which has much larger receptive fields in its first few layers. Surprisingly, we found that the best predictions of V1 responses arose from the first convolutional layer of a standard pretrained version of AlexNet, in sharp contrast to VGG-19. Because the predictive accuracy of pretrained AlexNet was slightly lower than the best-performing layer of VGG-19, we trained and tested a modified version of AlexNet that was designed to provide more feature maps in the lower layers to match the large number of feature representations available in conv3_1 of VGG-19. These analyses revealed that the first pooling layer of modified AlexNet performed just as well as VGG-19 at predicting V1 responses on an image-by-image basis. In a series of control analyses, we further showed that the best-performing layer of VGG-19 is highly dependent on the size of the input images used to predict V1 responses. 
Given that pretrained AlexNet exhibited the highest neural predictivity in the first convolutional layer, which involves only a single stage of filtering and nonlinear rectification, we were motivated to compare such CNN performance with that of a simple V1 Gabor pyramid model with hand-engineered rather than learned filters. We further generated and tested multiple versions of the V1 Gabor pyramid model to test for multiple effects of divisive normalization, including cross-orientation inhibition and surround suppression (Carandini, Heeger, & Movshon, 1997). These analyses revealed a small but statistically significant benefit of incorporating normalization into these simpler V1 models. Although the Gabor V1 models performed slightly though significantly worse than the best CNN models we tested (i.e., conv3_1 of VGG-19 and pool_1 of modified AlexNet), these computational models are considerably simpler and could be argued to provide a more parsimonious account of V1 neuronal tuning. Taken together, our findings demonstrate that, although V1 processing is not strictly linear, only a small number of nonlinear computations are needed to account for the stimulus-driven response properties of the primary visual cortex. 
Materials and methods
Neurophysiology dataset
The V1 neuronal data and the visual stimuli used in the original study were obtained from a publicly available data repository posted by Cadena et al. (2019). Below, we provide a brief description of their experimental design and methods; further details can be found in the original paper. 
Two awake-behaving monkeys were presented with a set of 7250 complex images while activity from multiple isolated V1 neurons was recorded using a 32-channel linear silicon probe. The monkeys were trained to fixate on a small dot presented in the center of the display throughout each trial. Each trial consisted of a randomized sequence of 29 images, each shown for 60 ms. The images were centered over the population receptive field of the recorded neurons. A total of 250 trials were required to present all 7250 images in a pseudorandomized order while activity from multiple isolated units was recorded. The authors recorded responses from a total of 307 neurons over 23 recording sessions. Recordings of suitable quality were collected from 166 neurons with two or four independent observations obtained for all 7250 images; these data were used for the main analyses. 
The image set consisted of 1450 natural images selected from the ImageNet database (Russakovsky et al., 2015) and four different sets of synthetic textures (four synthetic textures for each individual natural image) that were designed to evoke activation patterns in different convolutional layers of VGG-19 that closely resembled those elicited by the original natural images (Gatys, Ecker, & Bethge, 2015). The natural images were first transformed into 256 × 256 grayscale images and then used to generate the synthetic images. Then, for all the natural and synthetic images, a circular crop was applied (140-pixel diameter, corresponding to 2° of visual angle), followed by a contrast-attenuated windowing procedure so that the image gradually faded beyond a diameter of 1° of visual angle (see Figure 1). 
Image preprocessing
For our study, the following image preprocessing steps were performed prior to analysis by the CNNs. For the images shown to the monkeys, the central 80 × 80-pixel region of the 140 × 140-pixel-sized images were cropped and then rescaled to 40 × 40 pixels using bicubic interpolation to match the input size used by Cadena et al. (2019). Bicubic interpolation was also chosen for all other image resizing procedures in this project. All grayscale images were z-normalized, based on the mean and standard deviation of the pixel intensity values across all 7250 images, before being fed to the models. The processed images were converted to RGB format, and zero padding was applied if required by the CNN architecture. 
Building V1 model with CNN features
We modeled the V1 dataset using two different approaches. In our first pipeline, we evaluated how well CNNs could predict V1 neuronal responses using linear regression, which was implemented using custom-written code in MATLAB (MathWorks, Natick, MA). For the second pipeline, we adopted the computational approach of Cadena et al. (2019), which relied on TensorFlow and stochastic gradient descent to fit a generalized linear model that assumed a Poisson noise distribution. Both pipelines relied on L1 regularization (Tibshirani, 1996) to select a sparser set of predictive units, which substantially improved model performance. Both pipelines led to very similar results and comparable levels of predictive performance. Here, we report the results of the second approach to facilitate the comparison of findings across studies. 
To model V1 neuronal responses, we presented the entire set of 7250 images to the CNNs to extract the activation patterns from each CNN layer. Specifically, activity patterns were obtained from each convolutional layer after rectified linear unit (ReLU) processing was applied, as well as from each pooling layer. Note that batch normalization was applied such that the activity patterns obtained from each feature map were adjusted to have zero mean and unit variance across all images within a batch (batch size of 256 images). This procedure ensured that L1 normalization would apply the same level of penalty to all unit response predictors within a feature map. The normalized unit responses to each image within a layer were then converted into a vector, and the response patterns to all 256 images were assembled to form a 2D matrix for the iterative regression analysis. We used 80% of the images for model training (training set), and reserved the remaining 20% of the images as the independent test set, using the same images for training and test as Cadena et al. (2019). The 2D matrix of activation patterns for each batch of images was used to construct or refine a generalized linear model to predict each V1 neuron's response to all images in that batch. In Equation 1 below, X represents the 2D matrix of unit responses to a batch of images, w is a vector that specifies how each CNN unit will be weighted to predict the V1 neuron's response to the images, and b is the bias term. To guarantee that the model predictions were non-negative and to maintain consistency with the approach of Cadena et al. (2019), we applied Equation 2 to the output of Equation 1. The resulting output, yadjusted, consists of a vector of predicted neuronal responses. Note that L1 regularization was applied to improve model performance and stability with a lambda of 0.03 (determined via cross-validation using training data only). Unlike Cadena et al. (2019), we chose not to apply smoothness or group sparsity as these regularization factors had a negligible impact on the goodness of fit of the models, as was also noted by the previous authors.  
\begin{eqnarray} y = wX + b\quad \end{eqnarray}
(1)
 
\begin{eqnarray} {y_{adjusted}} = \left\{ {\begin{array}{@{}*{1}{c}@{}} {{e^{\left( {y - 1} \right)}}\;if\;y \le 1}\\ {y\;if\;y > 1} \end{array}} \right.\quad \end{eqnarray}
(2)
 
The models and regression analyses were implemented in Tensorflow (Abadi et al., 2016) following the approach of Cadena et al. (2019), and all model training was identical to their study unless otherwise stated. All models were optimized using the Adam optimization algorithm (Kingma & Ba, 2015) with a learning rate of 10−4 and a batch size of 256. An early stopping technique was used to determine when to cease model training. To implement the early stopping process, we used 80% of the images within the training set to tune the model parameter estimates and the remaining portion of the training data to evaluate model performance every 100 training steps. If the performance did not improve for 10 consecutive evaluations, the learning rate was reduced to 1/3 of its current value, and this procedure was repeated three times. 
Model evaluation
After the regression weights were calculated using the training data, we evaluated the predictive accuracy of each CNN layer by calculating the Pearson correlation coefficient between the predicted and actual responses of individual neurons across all test images. Actual responses were based on the average number of spikes that were observed across different repetitions of a given test image for an isolated neuron. For model comparison, Fisher z-transformation was applied to the Pearson correlation scores obtained for each of the 166 neurons, after which we applied paired t-tests to determine which CNN models performed better at predicting neuronal responses. 
Noise ceiling
Given that each image was presented only a few times to each monkey (two or four trials per image), it would be challenging to obtain reliable estimates of the lower and upper bounds of the noise ceiling. Nevertheless, we thought it would be informative to compute these estimates as they might provide some benchmark as to how well the best models are performing. We focused on analyzing the neurons with four measurements per image, as the neurons with only two observations per image would lead to an enormous gap between the lower and upper noise ceiling estimates. For the lower bound, we adopted a leave-one-trial-out approach and used the averaged response of the remaining three trials to predict the response observed for the left-out trial; this was performed on each of the four trials. This analysis revealed a lower bound of mean r = 0.3944. For the upper bound, we calculated the correlation between the mean response to an image across all four trials with the response observed on each individual trial. This analysis revealed an upper bound of mean r = 0.6811. We then compared the performance of our CNN models with the lower bound of the noise ceiling. Note that these correlation values are somewhat lower than the main text, as the CNN model was being used to predict responses to images on individual trials. We found that the predictive accuracy of the best layer of VGG-19 and modified AlexNet was slightly but significantly lower than the lower bound of the noise ceiling: for VGG-19, r = 0.3791, t(114) = 2.45, p = 0.02; for modified AlexNet, r = 0.3809, t(114) = 2.28, p = 0.02. Thus, although CNN-based models performed very well at accounting for these V1 responses, there may still be some room for further improvement. 
Comparisons between AlexNet and VGG-19
We first compared the performance of VGG-19, using the same model version as Cadena et al. (2019), with that of AlexNet by employing the pretrained version available in PyTorch (Paszke et al., 2019). The layer that provided the best predictions of V1 neuronal responses was identified for each model, and performance accuracy (i.e., Pearson correlation) was then compared across the two models. The best-performing layer of VGG-19, conv3_1, consisted of 256 feature maps where each unit had an effective receptive field size of 24 pixels, which corresponded to 0.69° of visual angle in the neurophysiological experiments. The large number of feature maps available at conv3_1 could potentially provide an undue advantage for regression-based models to fit the responses of individual V1 neurons. Thus, in addition to our evaluation of a pretrained AlexNet model, we evaluated the performance of a modified AlexNet architecture that incorporated additional convolutional layers (each followed by a ReLU operation) to support a larger number of feature representations in the lower layers of the CNN (see Table 2). We also made modest adjustments to the filter sizes of this model so that the receptive field size would increase more gradually after layer 1. We trained modified AlexNet on RGB images (224 × 224 × 3 pixels) using the 1000 categories of images from ImageNet (ImageNet Large Scale Visual Recognition Challenge 2012 [ILSVRC2012]). Modified AlexNet was trained using stochastic gradient descent over a period of 100 epochs with an initial learning rate of 0.01 (decreased by a factor of 10 every 30 epochs), a batch size of 256, a weight decay of 0.0005, and a momentum of 0.9. Modified AlexNet reached a top-1 accuracy of 0.51. 
Table 2.
 
Comparison of the architectures of standard AlexNet and modified AlexNet.
Table 2.
 
Comparison of the architectures of standard AlexNet and modified AlexNet.
A similar regime was used to train the modified VGG-19 models, which had increased stride parameter values in the earliest one or two convolutional layers so that unit receptive fields would be considerably larger in the lower layers (see Table 3). More training epochs were required to train these deeper networks to achieve near-asymptotic levels of classification accuracy. Modified VGG-19 models were trained using stochastic gradient descent over a period of 180 epochs with an initial learning rate of 0.01 (decreased by a factor of 10 every 60 epochs), a batch size of 256, a weight decay of 0.0001, and a momentum of 0.9. Version 1 of VGG-19 attained a top-1 accuracy of 0.54, and version 2 attained a top-1 accuracy of 0.57. 
Table 3.
 
Architecture of the modified VGG-19 version 1 and version 2.
Table 3.
 
Architecture of the modified VGG-19 version 1 and version 2.
Evaluation of a V1 Gabor pyramid model
We also evaluated the performance of a Gabor-based V1 model, both to serve as a baseline comparison with the more complex CNN models and to assess whether V1 responses to complex natural images show evidence of divisive normalization (Carandini & Heeger, 2012). The Gabor wavelet pyramid (Lee, 1996) consisted of Gabor filters that occurred at spatial frequencies of 1, 2, 4, or 8 cycles per field of view (40 × 40 pixels) with a specified bandwidth of 1 octave. At each spatial frequency f, the Gabor filters were positioned on a 2f × 2f grid. At each grid location, there was a set of Gabor filters with eight possible orientations (0°, 22.5°, …, 157.5°) and four spatial phases (0°, 90°, 180°, and 270°). The filters were truncated by setting the proportion of the filters with a value less than 1% of the peak value to 0. All filters were normalized to have zero mean and unit length. We simulated simple cell responses by applying half-wave rectification (or ReLU) to the responses of the Gabor units for all four spatial phases. To simulate complex cell responses, the outputs from Gabor filters with phase values of 0° and 90° were squared, summed, and then square-rooted to calculate these responses. The responses of all units were rescaled by the (near-)maximum possible response that they might produce if they were presented with a square-wave grating at the optimal spatial frequency, orientation, and spatial phase. 
In comparison with the basic Gabor pyramid model, we evaluated whether the incorporation of contrast saturation (or contrast response nonlinearity) would lead to better prediction of V1 responses by applying an exponentiation function:  
\begin{eqnarray} {x_{Saturated}} = {x^k}\quad \end{eqnarray}
(3)
 
Given the response of a simulated unit, x, a contrast-saturated response can be generated by applying an exponent factor, k, using parameter values less than 1. We evaluated the performance of this contrast-saturated Gabor model using exponent values ranging from 0.1 to 1.0 (with 0.1 increments) and found that an exponent value of 0.6 led to the highest level of performance. 
In addition, we simulated the effects of divisive normalization to see whether the performance of the Gabor-based V1 model could be further improved. Three types of divisive normalization were simulated (Equation 4): one with spatially restricted divisive normalization (mimicking cross-orientation inhibition), one with orientation-tuned normalization from the surround, and one with non-selective normalization from the surround.  
\begin{eqnarray} {x_{normalized}} = {\rm{\;}}\frac{x}{{\left( {c + \sum {W_{ij}}{X_{ij}}} \right)}}\quad \end{eqnarray}
(4)
 
Given the response of a simulated unit, x, we could calculate a normalized response by incorporating the linearly weighted sum of the responses of that unit and its relevant neighbors in the denominator of Equation 4. Here, Xij represents the matrix of responses from all units in the normalization pool, Wij represents the weight assigned to each of these responses, and c is a single parameter that controls the strength of divisive normalization. In all cases, normalization was performed separately on simple cell responses and complex cell responses. 
For spatially restricted normalization, the normalization pool consisted of units with any orientation preference that shared the same preferences for spatial location and spatial frequency. We tested a range of values for parameter c ranging from 0.01 to 2.0 and found that a value of 0.25 led to the best model performance. 
We evaluated two different models of surround suppression; both relied on a 2D Gaussian function to weigh the responses of neighboring units in a spatially graded manner that declined as a function of distance. For the orientation-tuned surround suppression model, the normalization pool consisted of units that shared the same orientation and spatial frequency preferences. By contrast, the non-selective surround suppression model incorporated the responses of neighboring units with any orientation preference as long as they were tuned to the same spatial frequency. The spatial size of the normalization pool was scaled relative to the receptive field size of the simulated unit. Specifically, the standard deviation of the 2D Gaussian used to generate the linear weighting function (Wij) was obtained by incorporating the standard deviation of the receptive field of the Gabor unit and then scaling that value by a multiplicative factor (α). We evaluated a range of values of α (from 0.1 to 1.5) to identify those that led to the best model performance (α = 0.75 for orientation-selective surround model, α = 0.75 for non-selective surround model). Likewise, we varied the parameter value for c (ranging from 0.05 to 1.0) to find those that led to the best performance for the orientation-tuned surround suppression model (c = 0.25) and the non-selective surround suppression model (c = 0.1). 
Finally, we evaluated the impact of applying divisive normalization after first incorporating contrast saturation (using an exponent of 0.6) to determine whether the performance of the Gabor-based V1 model could be further improved. We used a similar approach as described above to identify parameter values for α and c that led to the best model performance. Because of the nonlinear impact of contrast saturation on the simulated responses, we expanded the range of c (from 0.05 to 2.5) and the range of α (from 0.05 to 1.5) to identify the parameter settings that led to the best performance. 
Code accessibility
Following the publication of this article, the code used in the reported analyses will be made publicly available at https://github.com/Huiyuan-Miao/V1-Nonlinear
Results
We compared the layer-wise performance of VGG-19 and AlexNet in terms of their ability to predict V1 neuronal responses to a large set of complex grayscale images (see Materials and Methods). The full image set consisted of 1450 natural images and an additional 5800 synthetic images that were generated to mimic the low-level properties of the natural images; every image was presented twice to one monkey and four times to a second monkey. 
We sought to determine whether AlexNet would necessarily require multiple nonlinear processing steps to attain peak predictive performance, as was the case for VGG-19. The standard AlexNet architecture relies on much larger filters in layer 1 (11 × 11 pixels) and a stride value of 4 to sample from the input images; these factors lead to much larger receptive field sizes in the first few convolutional layers of AlexNet (Table 2) when compared to VGG-19 (Table 1). 
We first replicated the main findings of Cadena et al. (2019) by confirming that the activation map of conv3_1 (the fifth convolutional layer) in VGG-19 provided the best predictions of V1 neuronal responses (Figure 2A). For simplicity of comparison, we adopted the same regression-based modeling approach of Cadena et al. (2019) by using unit activations in each layer as predictors of the response of a V1 neuron to a set of training images (80% of the data), with L1 regularization to estimate the weights for a sparse set of predictor units obtained from each layer. The trained model was then used to predict the response of the neuron to an independent set of test images (20% of the data). 
Figure 2.
 
Layer-wise model predictions of V1 responses for standard versions of VGG-19 (A) and AlexNet (B). Plots show the mean correlation between the predicted and actual responses of 166 neurons, using the patterns of unit activity from individual CNN layers as regressors. Best predictive performance emerges at a much later processing stage for VGG-19 (conv3_1) when compared to AlexNet (conv_1). Error bars indicate ±1 SEM.
Figure 2.
 
Layer-wise model predictions of V1 responses for standard versions of VGG-19 (A) and AlexNet (B). Plots show the mean correlation between the predicted and actual responses of 166 neurons, using the patterns of unit activity from individual CNN layers as regressors. Best predictive performance emerges at a much later processing stage for VGG-19 (conv3_1) when compared to AlexNet (conv_1). Error bars indicate ±1 SEM.
Of potential relevance, the receptive field size of the units in conv3_1 (24 × 24 pixels) spanned approximately half of the width of the input images (40 × 40 pixels) that were originally used to calculate CNN model responses, whereas prior stages of convolutional processing consisted of units with comparatively smaller receptive fields ranging from 3 to 16 pixels in width. It is worth noting that the input images used to generate VGG-19 model responses would have spanned only 1.14° of visual angle in the macaque V1 recording study, which recorded neuronal activity from parafoveal regions of V1 at eccentricities ranging from 1.4° to 3.0° of visual angle. 
In sharp contrast with the layer-wise performance of VGG-19, we found that the best predictive performance of AlexNet emerged in the first convolutional layer (Figure 2B), with performance declining in a monotonic fashion for each processing stage thereafter. These findings indicate that AlexNet requires far fewer nonlinear processing stages to obtain its best possible fit of V1 neuronal responses. However, it was still the case that the best-performing layer of VGG-19 showed significantly higher predictive accuracy than the best-performing layer of AlexNet (mean r of 0.5158 vs. 0.4887, respectively, averaged across 166 V1 neurons), t(165) = 10.09, p = 6.02 × 10−19. Thus, it remained possible that the many additional nonlinear computations performed by VGG-19 were still required to attain this higher level of predictive accuracy. 
We suspected that the poorer performance of the standard pretrained version of AlexNet might be attributable to a few factors. One consideration was that the first convolutional layer of AlexNet had receptive fields that were considerably smaller than the input images used for model evaluation (about 1/4 in size; see Table 2), whereas the receptive fields for convolutional layer 2 (51 × 51 pixels) and all subsequent layers exceeded the size of the input images. We wondered whether the best possible predictive performance might be easier to attain if the CNN receptive fields are only modestly smaller than that of the input images, leading to a so-called Goldilocks zone. A second consideration was the fact that conv3_1 of VGG-19 has a very large number of feature channels (i.e., 256 channels) in comparison to conv1 of AlexNet (i.e., 64 channels), which might have conferred an advantage to VGG-19 by providing a larger number of predictors or basis functions to account for more subtle aspects of the receptive field structure of a V1 neuron. 
We were therefore motivated to construct a modified architecture for AlexNet, which involved inserting an additional convolutional layer (with ReLU) before the first pooling layer and increasing the number of feature channels in conv_2 and subsequent layers (see Table 2 for a complete description of our modifications). This CNN model was then trained to classify the 1000 object categories from ImageNet (Russakovsky et al., 2015) until classification accuracy reached an asymptotic level of performance after 100 epochs (see Materials and Methods for details). 
We performed the same analyses on this modified version of AlexNet and found that V1 predictivity improved across convolutional layers 1 and 2, peaking at the first max-pooling layer (r = 0.5167), after which performance steadily declined (Figure 3A). A direct comparison of the predictive accuracy of pool_1 of modified AlexNet with conv3_1 of VGG-19 (Figure 3B) indicated that the two CNN models did not reliably differ in their performance, t(165) = 0.680, p = 0.497. These findings rule out the possibility that V1 neuronal responses can only be adequately explained by considering a much more complex set of nonlinear computations, such as those performed by the level of conv3_1 of VGG-19. Instead, our findings indicate that the number of nonlinear steps required to account for V1 responses is not likely to be as large as that suggested by Cadena and colleagues (2019)
Figure 3.
 
Model performance of modified AlexNet. (A) Mean correlation between predicted and actual V1 neuronal responses is highest at the first pooling layer (pool_1) and thereafter decreases monotonically. (B) Comparison of neural predictivity for the best-performing layer of VGG-19 and modified AlexNet. Each plot symbol indicates the prediction accuracy for a single V1 neuron. The performance of the pool_1 layer of modified AlexNet is comparable to that of VGG-19 conv3_1.
Figure 3.
 
Model performance of modified AlexNet. (A) Mean correlation between predicted and actual V1 neuronal responses is highest at the first pooling layer (pool_1) and thereafter decreases monotonically. (B) Comparison of neural predictivity for the best-performing layer of VGG-19 and modified AlexNet. Each plot symbol indicates the prediction accuracy for a single V1 neuron. The performance of the pool_1 layer of modified AlexNet is comparable to that of VGG-19 conv3_1.
If the V1 predictivity of a CNN layer depends at least partly on the relationship between CNN receptive field size and the size of the input images used to generate the responses of the model, then one would expect that manipulations of input size should lead to systematic shifts in the layer that can yield the best performance. Specifically, smaller input images should lead to better performance in the lower or earlier layers of a CNN, whereas larger input images should lead to better performance in higher layers. 
To test this hypothesis, we evaluated the performance of VGG-19 using input images that were scaled to be smaller (20 × 20 pixels) or larger (80 × 80 pixels) than the input size originally used for model evaluation (i.e., 40 × 40 pixels). As can be seen in Figure 4, this analysis confirmed that a smaller input size caused the best performance to shift to a lower layer of VGG-19 (i.e., pool_2 or the sixth layer), whereas the larger input size biased performance to favor a much higher layer (i.e., conv4_1 or the 12th layer). These findings demonstrate how non-neural factors (e.g., choice of input image size), distinct from the complexity or number of computational operations used for modeling, can impact the ability of a CNN model to predict neuronal responses. 
Figure 4.
 
Model performance of VGG-19 when tested with different input image sizes. Images were scaled to pixel dimensions of 20 × 20, 40 × 40 (original analysis), or 80 × 80 pixels. An increase in input image size caused the best performance of VGG-19 to shift to higher layers.
Figure 4.
 
Model performance of VGG-19 when tested with different input image sizes. Images were scaled to pixel dimensions of 20 × 20, 40 × 40 (original analysis), or 80 × 80 pixels. An increase in input image size caused the best performance of VGG-19 to shift to higher layers.
As a further test of the idea that receptive field size can strongly affect which layer of a CNN can attain the best predictive accuracy, we constructed two modified versions of VGG-19 that had larger receptive fields in their lower layers. Specifically, these models sampled from convolutional layer 1 or layers 1 and 2 with increased stride length (see Table 3). After the models were trained on ImageNet object classification, their layer-wise performance was evaluated. Figure 5 shows that the best predictive accuracy was obtained at a much earlier processing stage than the standard pretrained version of VGG-19, as both modified VGG-19 models appeared to peak between the processing stages of conv1_1 and conv2_1, much earlier than was observed for the original version of VGG-19 that we evaluated. Although these models did not attain the level of neural predictivity of the original VGG-19 model, their within-network patterns of performance provide further evidence in favor of the interpretation that only a small number of nonlinear computations are required to account for V1 neuronal responses. 
Figure 5.
 
Predictive performance of the modified VGG-19 models. The mean correlation between predicted and actual V1 responses peaked in the third convolutional layer (conv2_1) for modified VGG-19 version 1 and also version 2.
Figure 5.
 
Predictive performance of the modified VGG-19 models. The mean correlation between predicted and actual V1 responses peaked in the third convolutional layer (conv2_1) for modified VGG-19 version 1 and also version 2.
We conducted a final set of analyses to determine whether the V1 predictive performance of VGG-19 was inherently biased in a manner that could not be explained by the possibility that this model had somehow acquired highly complex yet appropriately tuned representations in its intermediate layers. Using the standard VGG-19 architecture, we randomly initialized the weights of every layer across 10 iterations, calculating the predictive accuracy of individual layers on each occasion. For these control networks, the highest predictive accuracy emerged in pool_2, the sixth layer of VGG-19, just prior to conv3_1 (Figure 6A, solid red curve). Thus, VGG-19 models that lack any knowledge of natural image structure show the same patterns of bias. Our findings involving V1 neural predictivity can be contrasted with a prior study that evaluated CNN models of neural responses to complex objects, for which a variety of control networks with randomized weights (including VGG-16) showed very similar levels of predictive performance from layer 1 onward (Storrs, Kietzmann, Walther, Mehrer, & Kriegeskorte, 2021). We performed an additional control analysis to evaluate the potential contributions of nonlinear complexity; this was done by removing every ReLU operation from VGG-19 except for the final one to occur in the feedforward pipeline of a layer-specific analysis. (Max-pooling operations were retained so that the change in receptive field size across layers would match standard VGG-19.) Despite the greatly reduced nonlinearity of these control networks, the same pattern of bias emerged across layers, with peak V1 predictivity occurring in pool_2 (Figure 6A, dashed red curve). When the same set of control analyses was performed on standard AlexNet, we found that predictive performance was quite stable over the first three layers, peaking at pool_1, and was generally lower from layer 4 onward. We consider this bias in favor of lower over intermediate layers to be preferable if one seeks to identify simpler trained models that ultimately perform well. For modified AlexNet, we observed very stable levels of V1 predictive performance across the first several layers with suggestions of a small peak at pool_1. From these analyses, we can conclude that the architecture of standard VGG-19 and its application to V1 data are inherently biased to favor better neural predictivity in the intermediate layers. Thus, the layer-wise predictive performance of trained VGG-19 should be interpreted with caution (i.e., Figure 2A), as the improvement in V1 predictivity across successive layers most likely benefits from the biases demonstrated here. 
Figure 6.
 
Control analyses showing V1 predictive performance using CNN models with randomly initialized weights. Predictive accuracy is shown for standard VGG-19 (A), standard AlexNet (B), and modified AlexNet (C). Solid curves depict correlation values averaged over 10 randomly initialized versions of each model, and dashed curves show the same analysis performed on modified models that perform only a single ReLU operation. Error bars, which are very small, indicate ±1 SEM based on variability across the 10 model iterations.
Figure 6.
 
Control analyses showing V1 predictive performance using CNN models with randomly initialized weights. Predictive accuracy is shown for standard VGG-19 (A), standard AlexNet (B), and modified AlexNet (C). Solid curves depict correlation values averaged over 10 randomly initialized versions of each model, and dashed curves show the same analysis performed on modified models that perform only a single ReLU operation. Error bars, which are very small, indicate ±1 SEM based on variability across the 10 model iterations.
Evaluation of V1 Gabor pyramid models with nonlinearities
Given that the best-performing layer of pretrained AlexNet emerged in convolutional layer 1, this suggests that much simpler shallow models should have the potential to perform well at accounting for the response properties of V1 neurons. Whereas AlexNet acquires its front-end filters from supervised training at object classification, another approach could be to use hand-engineered filters to model V1 responses, as one might be able to choose a more evenly distributed or complete set of basis filters. 
We were therefore motivated to perform a parallel set of analyses on this V1 dataset using a Gabor-based pyramid model (Kay, Naselaris, Prenger, & Gallant, 2008; Lee, 1996) to test the predictive performance of a base model and evaluate whether additional nonlinear operations are necessary to account for V1 responses to complex natural images. Our Gabor pyramid consisted of an array of simulated simple cell and complex cell responses that spanned four spatial scales, eight orientations, and four spatial phases (see Materials and Methods). This array of simulated units was used to calculate response patterns to each image, after which regularized regression was used to predict the responses of individual V1 neurons. Our evaluation of the base V1 Gabor model revealed that it predicted V1 responses quite well and actually exhibited a marginally significant advantage when compared with conv_1 of pretrained AlexNet (mean r = 0.4946 vs. 0.4887), t(165) = 1.92, p = 0.057. 
Moreover, it is well documented that V1 responses increase monotonically as a function of stimulus contrast but in a nonlinear manner, as response saturation or compression occurs at higher stimulus contrasts (Albrecht & Geisler, 1991; Boynton, Demb, Glover, & Heeger, 1999; Ohzawa, Sclar, & Freeman, 1985; Sclar, Maunsell, & Lennie, 1990; Skottun et al., 1991; Tong, Harrison, Dewey, & Kamitani, 2012). Thus, one of the simplest forms of nonlinearity that can be applied to a modeled V1 response, after half-wave rectification, is some type of contrast saturation effect. We mimicked the effects of contrast saturation by applying an exponentiation function to the output response of modeled simple cell and complex cell units, using exponent values ranging from 0.1 to 1.0 with increments of 0.1. This analysis revealed that compression of the Gabor-based responses with an exponent value of 0.6 (i.e., close to a square root function) led to a quantitatively modest but statistically significant improvement in the prediction of V1 responses when compared to the base model with no compression (mean r of 0.4979 vs. 0.4946, respectively), t(165) = 3.94, p = 1.22 × 10−4. Moreover, this version of the Gabor model performed significantly better than conv_1 of pretrained AlexNet, t(165) = 3.13, p = 0.002. Although our Gabor model with contrast saturation did not quite reach the predictive accuracy of the best-performing layer of VGG-19 or modified AlexNet, the differences in performance between these CNN models and our Gabor model were quite modest, with CNNs showing only a modest advantage as the differences in mean r fell below a value of 0.02 (Figure 7). 
Figure 7.
 
Comparison between CNN-based V1 models and Gabor-based V1 models. Bar plots show the predictive accuracy of the conv3_1 layer of VGG-19, pool_1 layer of modified AlexNet, and multiple versions of the Gabor-based V1 model.
Figure 7.
 
Comparison between CNN-based V1 models and Gabor-based V1 models. Bar plots show the predictive accuracy of the conv3_1 layer of VGG-19, pool_1 layer of modified AlexNet, and multiple versions of the Gabor-based V1 model.
The effects of both cross-orientation inhibition and orientation-tuned surround suppression have been extensively studied in V1 (Bair, Cavanaugh, & Movshon, 2003; Bonds, 1989; Busse, Wade, & Carandini, 2009; Cavanaugh, Bair, & Movshon, 2002; Deangelis, Freeman, & Ohzawa, 1994; Deangelis, Robson, Ohzawa, & Freeman, 1992; Jones, Grieve, Wang, & Sillito, 2001; Morrone, Burr, & Maffei, 1982; Nurminen, Merlin, Bijanzadeh, Federer, & Angelucci, 2018; Poltoratski & Tong, 2020; Vinje & Gallant, 2002), and are commonly implemented in computational models by assuming divisive normalization (Carandini et al., 1997; Heeger, 1992b). Divisive normalization is believed to reflect a canonical neural computation that occurs in most brain areas, in which the feedforward response of each excitatory neuron is divisively modified by the summed activity of its neighbors, presumably as a consequence of some form of local inhibition (Carandini & Heeger, 2012). Although cross-orientation inhibition and surround suppression can be readily demonstrated by using artificial stimuli that are specifically tailored to test for these effects (e.g., sinewave gratings), it is unclear whether such modulatory interactions would be readily detectable when evaluating V1 responses to a large diverse set of natural (and synthetic) images. 
For our first set of analyses, we constructed V1 Gabor models with different types of divisive normalization. One model mimicked cross-orientation inhibition by applying normalization across all oriented units of a given spatial scale in a location-specific manner. Another model mimicked the effects of surround suppression by applying normalization across spatially neighboring units (using a 2D Gaussian window) that shared a common orientation and spatial frequency preference, and a third model applied spatial normalization without selectivity for orientation. For each model, a constant additive term in the denominator was allowed to vary as a free parameter to modify the strength of divisive normalization (see Materials and Methods). 
In comparison to the base model (mean r = 0.4946), we found that normalization led to significantly better performance for the cross-orientation inhibition model (mean r = 0.4976), t(165) = 4.32, p = 2.64 × 10−5; the orientation-tuned surround suppression model (mean r = 0.5002), t(165) = 7.39, p = 7.12 × 10−12; and the non-selective surround suppression model (mean r = 0.4992), t(165) = 6.33, p = 2.21 × 10−9. Although the above findings provide tentative evidence that normalization may affect V1 neuronal responses to complex images, it is important to consider whether normalization would further improve the prediction of V1 responses after the nonlinear effect of contrast saturation is taken into consideration. We therefore implemented the contrast saturation effect with an exponent of 0.6 and then performed divisive normalization with the models described above. In comparison with the contrast saturation model without normalization (mean r = 0.4979), the model with both contrast saturation and cross-orientation inhibition did not show significant gains in performance (mean r = 0.4977), t(165) = 0.58, p = 0.56. Likewise, the model with contrast saturation and orientation-tuned surround suppression did not outperform the model with contrast saturation alone (mean r = 0.4982), t(165) = 0.2869, p = 0.77. However, we did observe significant improvement in V1 predictions for the contrast-saturated model with non-selective surround suppression (mean r = 0.4991), t(165) = 2.17, p = 0.03, when compared to the contrast saturation model without normalization. One interpretation of these findings is that normalization is less influential if one presupposes the existence of a mechanism to mediate contrast saturation in V1 neurons. However, an alternative interpretation is that some form of normalization (e.g., cross-orientation inhibition) is responsible for causing contrast saturation in V1, as has been suggested by prior research (Heeger, 1992b). 
Our analyses of these V1 responses using multiple variants of a Gabor pyramid model provide positive evidence of additional nonlinear computations that take place prior to or within V1, which remain detectable in neural responses to complex natural images (Coen-Cagli, Kohn, & Schwartz, 2015; Vinje & Gallant, 2002). It is also worth noting that, although the potential contributions of divisive normalization appear quite modest in this study of V1 responses to natural and complex images, much more powerful effects have been documented in studies that rely on artificial stimuli (e.g., gratings) and tailored experimental conditions to test for the effects of cross-orientation inhibition and surround suppression (e.g., Busse et al., 2009; Cavanaugh et al., 2002). Thus, whether it is best to investigate the response properties of V1 using natural images or artificial stimuli may depend on the nature of the neuroscientific question to be tested (Felsen & Dan, 2005; Kay et al., 2008; Rust & Movshon, 2005). 
Although our best-performing Gabor model with orientation-tuned surround suppression could predict V1 responses suitably well (mean r = 0.5002), it was still outperformed by VGG-19 (r = 0.5158) and modified AlexNet (r = 0.5167) by a modest but statistically significant margin. Given that these V1 Gabor models are arguably much simpler, more parsimonious and more readily interpretable, how should these factors be weighted in comparison with the more complex and less readily interpretable CNN models? The preferred model of choice may depend on the goals of the study or the preferred theoretical framework of the researcher. From our own perspective, we suspect that the simpler Gabor-based model with additional normalization mechanisms may account for a majority of V1 tuning properties, whereas a subset of V1 neurons may have more complex receptive fields that arise from a longer sequence of nonlinear operations. Whether it might be possible to adjudicate between best-fitting models at the single-neuron level remains a potentially interesting and challenging question for future studies to explore. 
Discussion
Convolutional neural networks have become increasingly influential in neuroscientific research, as they currently provide the best models for predicting neural responses to complex natural images in both lower and higher visual areas (Bankson, Hebart, Groen, & Baker, 2018; Bashivan, Kar, & DiCarlo, 2019; Cadena et al., 2019; Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Guclu & van Gerven, 2015; Horikawa & Kamitani, 2017; Jang, McCormack, & Tong, 2021; Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019; Khaligh-Razavi & Kriegeskorte, 2014; Kietzmann et al., 2019; Nonaka, Majima, Aoki, & Kamitani, 2021; Schrimpf et al., 2020; Tong & Jang, 2021; Xu & Vaziri-Pashkam, 2021; Yamins & DiCarlo, 2016; Yamins et al., 2014). Understanding how these CNN models attain such high levels of neural predictivity, especially when compared to simpler models, is important for appreciating the neurally relevant computations performed by CNNs and for understanding what other factors might account for their gains in performance. Here, we investigated the claims of a recent study that reported obtaining the best predictions of V1 neuronal responses by using the unit activity patterns from an intermediate layer of VGG-19 after several nonlinear operations were performed on the initial input (Cadena et al., 2019). 
Although it is conceivable that stimulus-driven processing in V1 might involve numerous nonlinear computations, we hypothesized that units in the lower layers of VGG-19 may have led to poorer predictions of V1 neuronal responses because of their small receptive field size when compared to the input images used for model evaluation. We obtained multiple lines of evidence to support this view. First, we evaluated AlexNet, a well-known CNN with much larger convolutional filters in the early layers. The effective receptive field size of AlexNet rapidly increases over the first few layers, making it suitable to adjudicate this question. We found that AlexNet achieved its best predictive performance by the first convolutional layer. Although layer conv3_1 of VGG-19 performed somewhat better than this version of AlexNet, we found that a modified version of AlexNet was capable of matching the predictive performance of VGG-19, and here the best-performing layer emerged in the first pooling layer after only two prior stages of convolutional processing. If one considers each layer of a CNN as performing a single set of nonlinear operations, then the difference in model complexity between pool_1 of modified AlexNet and conv3_1 of VGG-19 is substantial, akin to comparing the flexibility of a third-order polynomial with that of a seventh-order polynomial equation. Based on Occam's razor, if two computational models perform comparably well at accounting for a neuronal dataset, one should favor the simpler model over the more complex one. For these reasons, our findings provide compelling evidence indicating that, although V1 is not strictly linear, the number of nonlinear computations required to explain stimulus-driven V1 responses is far less than recently claimed by Cadena et al. (2019), consistent with other prominent V1 models (Heeger, 1992a; Jones & Palmer, 1987; Rust et al., 2005; Vintch et al., 2015). 
To better understand how CNN model performance may depend on the relationship between input stimulus size and the effective receptive field size of the units in a given layer used for neural prediction, we performed a series of follow-up analyses. In one control analysis, we found that smaller input images caused the best predictions to shift to lower layers of VGG-19 whereas larger input images caused a shift toward higher layers. Note that, although Cadena et al. (2019) performed a similar control analysis, their image size manipulations by a factor of 1.5 were too modest to lead to significant shifts in performance accuracy. 
These findings demonstrate that the relationship between CNN receptive field size and input image size can directly impact how well CNNs can predict neuronal responses. Presumably, if too many units must be considered to account for the full spatial extent of the receptive field of a V1 neuron, the resulting model fits may become less stable or less accurate. In related work, Marques, Schrimpf, & DiCarlo (2021) found that the considerations of stimulus size and presumed visual angle could also impact the estimated similarity between CNN unit responses and V1 responses in a separate neurophysiological study. 
In another control analysis, we constructed modified versions of VGG-19 with increased stride values in the lower layers so that larger receptive fields would be acquired following supervised training on object classification. These modified versions of VGG-19 attained their highest level of V1 predictivity at a much earlier processing stage than the original VGG-19 model (see Figure 5). Thus, manipulations of CNN receptive field size and stimulus input size can cause systematic shifts in the CNN layer that best predicts neuronal responses in early visual areas, with the best performance typically observed in a CNN layer with receptive fields that are modestly smaller than the input images used to evoke model responses. Finally, one might ask why the original pretrained VGG-19 model is biased to attain peak neural predictivity at a much later processing stage than AlexNet or other models such as modified VGG-19. Is it simply because the CNN receptive fields in the lower layers are too small and thus mismatched with the input images used for model evaluation, or might there be something intrinsically special about the complex nonlinear representations that VGG-19 has learned from training on large sets of natural images? Recent work has shown that CNNs trained on images of objects in randomized noise or purposefully generated adversarial noise can lead to superior performance at predicting neural responses in some cases (Jang et al., 2021; Kong, Margalit, Gardner, & Norcia, 2022). Moreover, augmented training with blurry object images can lead to wide-ranging improvements in neural predictivity, even for out-of-distribution viewing conditions (Jang & Tong, 2024). Thus, the choice of training images can have a major impact on the visual representations learned by a CNN model and its ability to predict behavioral and neural responses. Although the problem space of possible image training regimes is extremely large and arguably unbounded, here we asked a simpler question. Namely, are the complex learned representations of VGG-19 responsible for the prominent bias we see in terms of higher predictive accuracy emerging in the intermediate layers? To address this question, we evaluated the predictive performance of multiple randomly initialized versions of VGG-19. These untrained models, which lacked any knowledge of natural image statistics, nevertheless exhibited a steady improvement in neural predictivity from conv1_1 to pool_2, the layer just prior to conv3_1 (see Figure 6). 
Even though the convolutional filters of these untrained models were randomly structured, we considered whether the additional nonlinear operations performed across successive layers might still have contributed to the improvements in V1 predictivity. Specifically, we evaluated the standard VGG-19 model architecture but used randomized weights and further removed all ReLU operations except for the final one that preceded our neuronal analysis pipeline. (Including a final ReLU operation was generally beneficial for obtaining more stable regression predictions as it made the weights of the analyzed units sparser.) Our analyses revealed that untrained VGG-19 models with only one ReLU operation nevertheless exhibited the same trend of increasing V1 predictivity from layer 1 through to pool_2. From these analyses, we can conclude that the V1 predictive performance of standard VGG-19 is inherently biased in favor of later processing stages, in a manner that is independent of the learned properties of natural images or the degree of nonlinear complexity of the layer being analyzed. In comparison, we found that untrained versions of modified AlexNet exhibited stable levels of V1 predictive performance across the first several layers and thus would not be biased to favor a later processing stage over an earlier one (i.e., a more complex model over a simpler model). 
In addition to our evaluation of CNN models, we compared different versions of a Gabor wavelet pyramid model to gain a better understanding of the nonlinear tuning properties of V1 neurons. Although the basic Gabor pyramid model was not able to attain the predictive accuracy of the best-performing layer of AlexNet or VGG-19, its performance was still quite high despite the simplicity of the model. First, we confirmed that V1 neurons showed evidence of contrast saturation in their responses to the complex natural and synthetic images, consistent with previous work that tested simpler stimuli (Albrecht & Geisler, 1991; Boynton et al., 1999; Ohzawa et al., 1985; Sclar et al., 1990; Skottun et al., 1991; Tong et al., 2012). Next, we tested for potential effects of cross-orientation inhibition and surround suppression by incorporating various types of divisive normalization (Carandini & Heeger, 2012) into our Gabor pyramid model. We found clear evidence indicating that divisive normalization from neighboring local units (irrespective of orientation) led to better predictions of V1 responses. Also, if we did not presuppose a separate mechanism to mediate contrast saturation, we found that better model performance arose by incorporating mechanisms of cross-orientation inhibition and orientation-tuned surround suppression into our V1 Gabor models. These findings indicate that the effects of surround suppression in V1 are pervasive enough to be detected when evaluated using an arbitrarily selected set of complex natural images. That said, the use of artificial images such as sinewave gratings may be more effective for generating tailored stimuli to test specific hypotheses about the response properties of V1 (Rust et al., 2005), including the nature of excitatory center/suppressive surround interactions (Mely, Linsley, & Serre, 2018). 
To provide a full evaluation of the complexity of neuronal tuning in V1, it is necessary to consider both simple and more complex computational models. CNNs can be very informative in this regard as they are designed to simulate the feedforward operations of the visual system. Also, CNNs learn useful statistical properties from natural images, and, moreover, a single network will acquire multilevel representations of varying complexity that can be used to assess neuronal complexity. However, as we show here, it is critically important to consider the impact of other correlated variables such as unit receptive field size when one uses CNNs to evaluate the complexity of neuronal tuning. We conclude that, at least during the first wave of feedforward processing, V1 neuronal responses are well described by simple linear filters followed by only a few nonlinear operations. 
Our findings are consistent with traditional V1 models that have typically relied on only one or two nonlinear operations to fit V1 neuronal responses (Jones & Palmer, 1987; Rust et al., 2005; Vintch et al., 2015). Why then do CNN models tend to outperform these traditional V1 models, including the V1 Gabor pyramid model that was tested here? One possibility might be that CNNs learn better filters in layer 1 due to their training with large sets of natural images. However, our base V1 Gabor pyramid model exhibited a marginally significant advantage at predicting V1 responses in comparison to layer 1 of pretrained AlexNet (t = 1.92, p = 0.057), the best-performing layer of that model. With respect to modified AlexNet, we found that V1 predictive performance improved significantly between the first and second convolutional layers (t = 6.9, p = 10−10), subsequently reaching a peak by pool_1. However, given the modest margin by which modified AlexNet outperformed our simple Gabor-based models, it seems plausible that only a fraction of the neurons in V1 acquire more complex receptive field structures that require more than one nonlinear operation to be well modeled. Motivated by this question, we conducted a post hoc analysis that compared how well conv_1 versus conv_2 of modified AlexNet could predict the image-by-image responses of individual neurons. Using a two-tailed Fisher test (p < 0.05, uncorrected), we found that 41 of 166 individual neurons were better predicted by the unit responses in conv_2 as compared to conv_1, whereas 0 of the 166 neurons was better predicted by conv_1 as compared to conv_2. A comparison between conv_1 and pool_1 (i.e., best-performing layer) revealed a similar asymmetry in predictive performance (55/166 in favor of pool_1, 1/166 in favor of conv_1). This notable asymmetry is consistent with the notion that a minority of V1 neurons may have more complex receptive field structures that are better described by at least two convolutional operations and perhaps an additional pooling operation, a finding that is largely consistent with the two-stage linear–nonlinear models proposed by previous researchers (Vintch et al., 2015). 
Although Cadena et al. (2019) sought to characterize the feedforward or stimulus-driven response properties of V1 neurons, it should be acknowledged that top–down feedback from higher visual areas can rapidly modify feedforward responses and can also operate independently of top–down attention (Nurminen et al., 2018; Poltoratski, Maier, Newton, & Tong, 2019). Other studies have shown that recurrent visual processing and top–down effects of task-based attention can have powerful modulatory effects on neuronal responses in macaque V1 (Gilbert & Li, 2013; Lamme & Roelfsema, 2000; Roelfsema, Lamme, & Spekreijse, 1998). Work from our own research group has shown that binocular rivalry competition, spatial and feature-based attention, object-based attentional selection, and visual working memory all lead to powerful modulatory effects in the human primary visual cortex (Cohen & Tong, 2015; Harrison & Tong, 2009; Jehee, Brady, & Tong, 2011; Kamitani & Tong, 2005; Tong & Engel, 2001). By contrast, a limitation of feedforward neural network models is their inability to account for top–down effects of attention and other task-based goals (Kay, Bonnen, Denison, Arcaro, & Barack, 2023; Tong, 2018). It will be of considerable interest for future studies to explore whether variations in CNN architecture, the incorporation of recurrent or top–down processing, or the expansion of stimuli and methods used for network training can further improve the ability of CNN models to predict the nonlinear response properties of V1. 
Finally, our study highlights some of the nuances and challenges of evaluating CNN-based models of visual processing. Seemingly innocuous choices in model architecture can sometimes have a dramatic impact on the ability of a model to predict neural data, and this in turn could impact one's interpretation of how the visual system might actually work. In our study, we found that the relationship between stimulus size and unit receptive field size had a strong influence on which layer of a CNN provided the best predictions of V1 neuronal responses. Without careful examination of these non-obvious parts of the problem space, one might conclude that a far more complex nonlinear model is required to account for the response properties of V1 than was previously thought. Given that the neural predictivity of a CNN model will depend on many factors, including model architecture, objective function, training regime, choice of test stimuli, and the methodology used to compare neural data with model responses, in some cases, it may be necessary to examine a considerable portion of this larger problem space to determine the generality of a particular finding or outcome. 
Acknowledgments
The authors thank Hojin Jang for input on the CNN analysis and Dave Coggan for feedback on early versions of this manuscript. 
Supported by grants from the National Eye Institute, National Institutes of Health (R01EY029278 and R01EY035157 to FT; P30EY008126 to the Vanderbilt Vision Research Center). 
Commercial relationships: none. 
Corresponding authors: Hui-Yuan Miao and Frank Tong. 
Emails: huiyuan.miao@vanderbilt.edu; frank.tong@vanderbilt.edu. 
Address: Department of Psychology, Vanderbilt University, Nashville, TN 37240, USA. 
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... Zhang, X. (2016). TensorFlow: A system for large-scale machine learning. In OSDI ’16: Proceedings of the 12th USENIX Conference on Operating Systems Design and Maintenance (pp. 265–283). Berkeley, CA: USENIX.
Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299, https://www.ncbi.nlm.nih.gov/pubmed/3973762. [CrossRef]
Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrast-response function of simple cells in the visual-cortex. Visual Neuroscience, 7(6), 531–546. [CrossRef] [PubMed]
Bair, W., Cavanaugh, J. R., & Movshon, J. A. (2003). Time course and time-distance relationships for surround suppression in macaque V1 neurons. The Journal of Neuroscience, 23(20), 7690–7701, https://www.ncbi.nlm.nih.gov/pubmed/12930809. [CrossRef]
Bankson, B. B., Hebart, M. N., Groen, I. I. A., & Baker, C. I. (2018). The temporal evolution of conceptual object representations revealed through models of behavior, semantics and deep neural networks. NeuroImage, 178, 172–182, https://doi.org/10.1016/j.neuroimage.2018.05.037. [CrossRef] [PubMed]
Bashivan, P., Kar, K., & DiCarlo, J. J. (2019). Neural population control via deep image synthesis. Science, 364(6439), eaav9436, https://doi.org/10.1126/science.aav9436. [CrossRef] [PubMed]
Bonds, A. B. (1989). Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Visual Neuroscience, 2(1), 41–55, https://doi.org/10.1017/S0952523800004314. [CrossRef] [PubMed]
Boynton, G. M., Demb, J. B., Glover, G. H., & Heeger, D. J. (1999). Neuronal basis of contrast discrimination. Vision Research, 39(2), 257–269, https://doi.org/10.1016/s0042-6989(98)00113-8. [CrossRef] [PubMed]
Busse, L., Wade, A. R., & Carandini, M. (2009). Representation of concurrent stimuli by population activity in visual cortex. Neuron, 64(6), 931–942, https://doi.org/10.1016/j.neuron.2009.11.004. [CrossRef] [PubMed]
Cadena, S. A., Denfield, G. H., Walker, E. Y., Gatys, L. A., Tolias, A. S., Bethge, M., ... Ecker, A. S. (2019). Deep convolutional models improve predictions of macaque V1 responses to natural images. PLoS Computational Biology, 15(4), e1006897, https://doi.org/10.1371/journal.pcbi.1006897. [CrossRef] [PubMed]
Carandini, M., Demb, J. B., Mante, V., Tolhurst, D. J., Dan, Y., Olshausen, B. A., ... Rust, N. C. (2005). Do we know what the early visual system does? The Journal of Neuroscience, 25(46), 10577–10597. [CrossRef]
Carandini, M., & Heeger, D. J. (2012). Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13(1), 51–62, https://doi.org/10.1038/nrn3136. [CrossRef]
Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. The Journal of Neuroscience, 17(21), 8621–8644, https://www.ncbi.nlm.nih.gov/pubmed/9334433. [CrossRef]
Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002). Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. Journal of Neurophysiology, 88(5), 2530–2546, https://doi.org/10.1152/jn.00692.2001. [CrossRef] [PubMed]
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755, https://doi.org/10.1038/srep27755. [CrossRef] [PubMed]
Coen-Cagli, R., Kohn, A., & Schwartz, O. (2015). Flexible gating of contextual influences in natural vision. Nature Neuroscience, 18(11), 1648–1655, https://doi.org/10.1038/nn.4128. [CrossRef] [PubMed]
Cohen, E. H., & Tong, F. (2015). Neural mechanisms of object-based attention. Cerebral Cortex, 25(4), 1080–1092, https://doi.org/10.1093/cercor/bht303. [CrossRef]
Deangelis, G. C., Freeman, R. D., & Ohzawa, I. (1994). Length and width tuning of neurons in the cats primary visual-cortex. Journal of Neurophysiology, 71(1), 347–374, https://doi.org/10.1152/jn.1994.71.1.347. [CrossRef] [PubMed]
Deangelis, G. C., Robson, J. G., Ohzawa, I., & Freeman, R. D. (1992). Organization of suppression in receptive-fields of neurons in cat visual-cortex. Journal of Neurophysiology, 68(1), 144–163, https://doi.org/10.1152/jn.1992.68.1.144. [CrossRef] [PubMed]
Felsen, G., & Dan, Y. (2005). A natural approach to studying vision. Nature Neuroscience, 8(12), 1643–1646, https://doi.org/10.1038/nn1608. [CrossRef] [PubMed]
Gatys, L., Ecker, A. S., & Bethge, M. (2015). Texture synthesis using convolutional neural networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., & Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28 (pp. 262–270). Red Hook, NY: Curran Associates.
Gilbert, C. D., & Li, W. (2013). Top-down influences on visual processing. Nature Reviews Neuroscience, 14(5), 350–363, https://doi.org/10.1038/nrn3476. [CrossRef] [PubMed]
Guclu, U., & van Gerven, M. A. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. The Journal of Neuroscience, 35(27), 10005–10014, https://doi.org/10.1523/JNEUROSCI.5023-14.2015. [CrossRef]
Harrison, S. A., & Tong, F. (2009). Decoding reveals the contents of visual working memory in early visual areas. Nature, 458(7238), 632–635, https://doi.org/10.1038/nature07832. [CrossRef] [PubMed]
Heeger, D. J. (1992a). Half-squaring in responses of cat striate cells. Visual Neuroscience, 9(5), 427–443, https://doi.org/10.1017/S095252380001124x. [CrossRef] [PubMed]
Heeger, D. J. (1992b). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9(2), 181–197, https://doi.org/10.1017/S0952523800009640. [CrossRef] [PubMed]
Horikawa, T., & Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature Communications, 8, 15037. [CrossRef] [PubMed]
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of Physiology, 160, 106–154, https://doi.org/10.1113/jphysiol.1962.sp006837. [CrossRef] [PubMed]
Jang, H., McCormack, D., & Tong, F. (2021). Noise-trained deep neural networks effectively predict human vision and its neural responses to challenging images. PLoS Biology, 19(12), e3001418, https://doi.org/10.1371/journal.pbio.3001418. [CrossRef] [PubMed]
Jang, H., & Tong, F. (2024). Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks. Nature Communications, 15(1), 1989. [CrossRef] [PubMed]
Jehee, J. F., Brady, D. K., & Tong, F. (2011). Attention improves encoding of task-relevant features in the human visual cortex. The Journal of Neuroscience, 31(22), 8210–8219, https://doi.org/10.1523/JNEUROSCI.6153-09.2011. [CrossRef]
Jones, H. E., Grieve, K. L., Wang, W., & Sillito, A. M. (2001). Surround suppression in primate V1. Journal of Neurophysiology, 86(4), 2011–2028, https://doi.org/10.1152/jn.2001.86.4.2011. [CrossRef] [PubMed]
Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1233–1258, https://doi.org/10.1152/jn.1987.58.6.1233. [CrossRef] [PubMed]
Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature Neuroscience, 8(5), 679–685, https://doi.org/10.1038/nn1444. [CrossRef] [PubMed]
Kar, K., Kubilius, J., Schmidt, K., Issa, E. B., & DiCarlo, J. J. (2019). Evidence that recurrent circuits are critical to the ventral stream's execution of core object recognition behavior. Nature Neuroscience, 22(6), 974–983, https://doi.org/10.1038/s41593-019-0392-5. [CrossRef] [PubMed]
Kay, K., Bonnen, K., Denison, R. N., Arcaro, M. J., & Barack, D. L. (2023). Tasks and their role in visual neuroscience. Neuron, 111(11), 1697–1713. [CrossRef] [PubMed]
Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). Identifying natural images from human brain activity. Nature, 452(7185), 352–355, https://doi.org/10.1038/nature06713. [CrossRef] [PubMed]
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Computational Biology, 10(11), e1003915, https://doi.org/10.1371/journal.pcbi.1003915. [CrossRef] [PubMed]
Kietzmann, T. C., Spoerer, C. J., Sorensen, L. K. A., Cichy, R. M., Hauk, O., & Kriegeskorte, N. (2019). Recurrence is required to capture the representational dynamics of the human visual system. Proceedings of the National Academy of Sciences, USA, 116(43), 21854–21863, https://doi.org/10.1073/pnas.1905544116. [CrossRef]
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. Red Hook, NY: Curran Associates.
Kong, N. C. L., Margalit, E., Gardner, J. L., & Norcia, A. M. (2022). Increasing neural network robustness improves match to macaque V1 eigenspectrum, spatial frequency preference and predictivity. PLoS Computational Biology, 18(1), e1009739. [CrossRef] [PubMed]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J., Bottou, L., & Weinberger, K. Q. (Eds.), Advances in Neural Information Processing Systems. Red Hook, NY: Curran Associates.
Lamme, V. A., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences, 23(11), 571–579, https://doi.org/10.1016/s0166-2236(00)01657-x. [CrossRef] [PubMed]
Lee, T. S. (1996). Image representation using 2D Gabor wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10), 959–971.
Marques, T., Schrimpf, J., & DiCarlo, J. J. (2021). Multi-scale hierarchical neural network models that bridge from single neurons in the primate primary visual cortex to object recognition behavior. bioRxiv, https://doi.org/10.1101/2021.03.01.433495.
Mechler, F., & Ringach, D. L. (2002). On the classification of simple and complex cells. Vision Research, 42(8), 1017–1033, https://doi.org/S0042-6989(02)00025-1. [CrossRef] [PubMed]
Mely, D. A., Linsley, D., & Serre, T. (2018). Complementary surrounds explain diverse contextual phenomena across visual modalities. Psychological Review, 125(5), 769–784, https://doi.org/10.1037/rev0000109. [CrossRef] [PubMed]
Morrone, M. C., Burr, D. C., & Maffei, L. (1982). Functional implications of cross-orientation inhibition of cortical visual cells: I. Neurophysiological evidence. Proceedings of the Royal Society Series B. Biological Sciences, 216(1204), 335–354, https://doi.org/10.1098/rspb.1982.0078.
Nonaka, S., Majima, K., Aoki, S. C., & Kamitani, Y. (2021). Brain hierarchy score: Which deep neural networks are hierarchically brain-like? Iscience, 24(9), 103013. [CrossRef] [PubMed]
Nurminen, L., Merlin, S., Bijanzadeh, M., Federer, F., & Angelucci, A. (2018). Top-down feedback controls spatial summation and response amplitude in primate visual cortex. Nature Communications, 9(1), 2281, https://doi.org/10.1038/s41467-018-04500-5. [CrossRef] [PubMed]
Ohzawa, I., Sclar, G., & Freeman, R. D. (1985). Contrast gain-control in the cat's visual-system. Journal of Neurophysiology, 54(3), 651–667, https://doi.org/10.1152/jn.1985.54.3.651. [CrossRef] [PubMed]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., & Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32 (pp. 8026–8037). Red Hook, NY: Curran Associates.
Poltoratski, S., Maier, A., Newton, A. T., & Tong, F. (2019). Figure-ground modulation in the human lateral geniculate nucleus is distinguishable from top-down attention. Current Biology, 29(12), 2051–2057.e3. [CrossRef]
Poltoratski, S., & Tong, F. (2020). Resolving the spatial profile of figure enhancement in human V1 through population receptive field modeling. The Journal of Neuroscience 40(16), 3292–3303, https://doi.org/10.1523/JNEUROSCI.2377-19.2020. [CrossRef] [PubMed]
Priebe, N. J. (2016). Mechanisms of orientation selectivity in the primary visual cortex. Annual Review of Vision Science, 2, 85–107, https://doi.org/10.1146/annurev-vision-111815-114456. [CrossRef] [PubMed]
Priebe, N. J., Mechler, F., Carandini, M., & Ferster, D. (2004). The contribution of spike threshold to the dichotomy of cortical simple and complex cells. Nature Neuroscience, 7(10), 1113–1122, https://doi.org/10.1038/nn1310. [CrossRef] [PubMed]
Ringach, D. L., Shapley, R. M., & Hawken, M. J. (2002). Orientation selectivity in macaque V1: Diversity and laminar dependence. The Journal of Neuroscience, 22(13), 5639–5651, https://doi.org/20026567. [CrossRef]
Roelfsema, P. R., Lamme, V. A., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395(6700), 376–381, https://doi.org/10.1038/26475. [CrossRef] [PubMed]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252, https://doi.org/10.1007/s11263-015-0816-y. [CrossRef]
Rust, N. C., & Movshon, J. A. (2005). In praise of artifice. Nature Neuroscience, 8(12), 1647–1650, https://doi.org/10.1038/nn1606. [CrossRef] [PubMed]
Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2005). Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46(6), 945–956, https://doi.org/10.1016/j.neuron.2005.05.021. [CrossRef] [PubMed]
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., ... DiCarlo, J. J. (2020). Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv, https://doi.org/10.1101/407007.
Sclar, G., Maunsell, J. H., & Lennie, P. (1990). Coding of image contrast in central visual pathways of the macaque monkey. Vision Research, 30(1), 1–10, https://doi.org/10.1016/0042-6989(90)90123-3. [CrossRef] [PubMed]
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. ArXiv, https://doi.org/10.48550/arXiv.1409.1556.
Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulation. Vision Research, 31(7-8), 1079–1086, https://doi.org/10.1016/0042-6989(91)90033-2. [CrossRef] [PubMed]
Storrs, K. R., Kietzmann, T. C., Walther, A., Mehrer, J., & Kriegeskorte, N. (2021). Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. Journal of Cognitive Neuroscience, 33(10), 2044–2064. [PubMed]
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B. Methodological, 58(1), 267–288, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. [CrossRef]
Tong, F. (2018). Foundations of vision. In Serences, J. T. (Ed.), Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience. Volume 2. Sensation, Perception, and Attention. (pp. 1–61). New York: John Wiley & Sons.
Tong, F., & Engel, S. A. (2001). Interocular rivalry revealed in the human cortical blind-spot representation. Nature, 411(6834), 195–199, https://doi.org/10.1038/35075583. [CrossRef] [PubMed]
Tong, F., Harrison, S. A., Dewey, J. A., & Kamitani, Y. (2012). Relationship between BOLD amplitude and pattern classification of orientation-selective activity in the human visual cortex. NeuroImage, 63(3), 1212–1222, https://doi.org/10.1016/j.neuroimage.2012.08.005. [CrossRef] [PubMed]
Tong, F., & Jang, H. (2021). Noise-robust neural networks and methods thereof. U.S. Patent #11,030,487.
Vinje, W. E., & Gallant, J. L. (2002). Natural stimulation of the nonclassical receptive field increases information transmission efficiency in V1. The Journal of Neuroscience, 22(7), 2904–2915, https://doi.org/10.1523/JNEUROSCI.22-07-02904.2002. [CrossRef]
Vintch, B., Movshon, J. A., & Simoncelli, E. P. (2015). A convolutional subunit model for neuronal responses in macaque V1. The Journal of Neuroscience, 35(44), 14829–14841, https://doi.org/10.1523/Jneurosci.2815-13.2015. [CrossRef]
Xu, Y. D., & Vaziri-Pashkam, M. (2021). Limits to visual representational correspondence between convolutional neural networks and the human brain. Nature Communications, 12(1), 2065, https://doi.org/10.1038/s41467-021-22244-7. [CrossRef] [PubMed]
Yamins, D. L., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3), 356–365, https://doi.org/10.1038/nn.4244. [CrossRef] [PubMed]
Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, USA, 111(23), 8619–8624, https://doi.org/10.1073/pnas.1403112111. [CrossRef]
Figure 1.
 
Examples of the stimuli used in this study. The leftmost column shows examples of two natural images. Columns two to five show synthetic images derived from the natural images to evoke similar responses in the lower layers of VGG-19, with increasing correspondence across multiple layers shown from left to right.
Figure 1.
 
Examples of the stimuli used in this study. The leftmost column shows examples of two natural images. Columns two to five show synthetic images derived from the natural images to evoke similar responses in the lower layers of VGG-19, with increasing correspondence across multiple layers shown from left to right.
Figure 2.
 
Layer-wise model predictions of V1 responses for standard versions of VGG-19 (A) and AlexNet (B). Plots show the mean correlation between the predicted and actual responses of 166 neurons, using the patterns of unit activity from individual CNN layers as regressors. Best predictive performance emerges at a much later processing stage for VGG-19 (conv3_1) when compared to AlexNet (conv_1). Error bars indicate ±1 SEM.
Figure 2.
 
Layer-wise model predictions of V1 responses for standard versions of VGG-19 (A) and AlexNet (B). Plots show the mean correlation between the predicted and actual responses of 166 neurons, using the patterns of unit activity from individual CNN layers as regressors. Best predictive performance emerges at a much later processing stage for VGG-19 (conv3_1) when compared to AlexNet (conv_1). Error bars indicate ±1 SEM.
Figure 3.
 
Model performance of modified AlexNet. (A) Mean correlation between predicted and actual V1 neuronal responses is highest at the first pooling layer (pool_1) and thereafter decreases monotonically. (B) Comparison of neural predictivity for the best-performing layer of VGG-19 and modified AlexNet. Each plot symbol indicates the prediction accuracy for a single V1 neuron. The performance of the pool_1 layer of modified AlexNet is comparable to that of VGG-19 conv3_1.
Figure 3.
 
Model performance of modified AlexNet. (A) Mean correlation between predicted and actual V1 neuronal responses is highest at the first pooling layer (pool_1) and thereafter decreases monotonically. (B) Comparison of neural predictivity for the best-performing layer of VGG-19 and modified AlexNet. Each plot symbol indicates the prediction accuracy for a single V1 neuron. The performance of the pool_1 layer of modified AlexNet is comparable to that of VGG-19 conv3_1.
Figure 4.
 
Model performance of VGG-19 when tested with different input image sizes. Images were scaled to pixel dimensions of 20 × 20, 40 × 40 (original analysis), or 80 × 80 pixels. An increase in input image size caused the best performance of VGG-19 to shift to higher layers.
Figure 4.
 
Model performance of VGG-19 when tested with different input image sizes. Images were scaled to pixel dimensions of 20 × 20, 40 × 40 (original analysis), or 80 × 80 pixels. An increase in input image size caused the best performance of VGG-19 to shift to higher layers.
Figure 5.
 
Predictive performance of the modified VGG-19 models. The mean correlation between predicted and actual V1 responses peaked in the third convolutional layer (conv2_1) for modified VGG-19 version 1 and also version 2.
Figure 5.
 
Predictive performance of the modified VGG-19 models. The mean correlation between predicted and actual V1 responses peaked in the third convolutional layer (conv2_1) for modified VGG-19 version 1 and also version 2.
Figure 6.
 
Control analyses showing V1 predictive performance using CNN models with randomly initialized weights. Predictive accuracy is shown for standard VGG-19 (A), standard AlexNet (B), and modified AlexNet (C). Solid curves depict correlation values averaged over 10 randomly initialized versions of each model, and dashed curves show the same analysis performed on modified models that perform only a single ReLU operation. Error bars, which are very small, indicate ±1 SEM based on variability across the 10 model iterations.
Figure 6.
 
Control analyses showing V1 predictive performance using CNN models with randomly initialized weights. Predictive accuracy is shown for standard VGG-19 (A), standard AlexNet (B), and modified AlexNet (C). Solid curves depict correlation values averaged over 10 randomly initialized versions of each model, and dashed curves show the same analysis performed on modified models that perform only a single ReLU operation. Error bars, which are very small, indicate ±1 SEM based on variability across the 10 model iterations.
Figure 7.
 
Comparison between CNN-based V1 models and Gabor-based V1 models. Bar plots show the predictive accuracy of the conv3_1 layer of VGG-19, pool_1 layer of modified AlexNet, and multiple versions of the Gabor-based V1 model.
Figure 7.
 
Comparison between CNN-based V1 models and Gabor-based V1 models. Bar plots show the predictive accuracy of the conv3_1 layer of VGG-19, pool_1 layer of modified AlexNet, and multiple versions of the Gabor-based V1 model.
Table 1.
 
The architecture of standard VGG-19. The table shows the output dimensionality of each CNN layer and the associated kernel size, stride length and receptive field size in pixel units.
Table 1.
 
The architecture of standard VGG-19. The table shows the output dimensionality of each CNN layer and the associated kernel size, stride length and receptive field size in pixel units.
Table 2.
 
Comparison of the architectures of standard AlexNet and modified AlexNet.
Table 2.
 
Comparison of the architectures of standard AlexNet and modified AlexNet.
Table 3.
 
Architecture of the modified VGG-19 version 1 and version 2.
Table 3.
 
Architecture of the modified VGG-19 version 1 and version 2.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×