Deep convolutional neural networks (CNNs) trained on visual objects have shown intriguing ability to predict some response properties of visual cortical neurons. However, the factors (e.g., if the model is trained or not, receptive field size) and computations (e.g., convolution, rectification, pooling, normalization) that give rise to such ability, at what level, and the role of intermediate processing stages in explaining changes that develop across areas of the cortical hierarchy are poorly understood. We focused on the sensitivity to textures as a paradigmatic example, since recent neurophysiology experiments provide rich data pointing to texture sensitivity in secondary (but not primary) visual cortex (V2). We initially explored the CNN without any fitting to the neural data and found that the first two layers of the CNN showed qualitative correspondence to the first two cortical areas in terms of texture sensitivity. We therefore developed a quantitative approach to select a population of CNN model neurons that best fits the brain neural recordings. We found that the CNN could develop compatibility to secondary cortex in the second layer following rectification and that this was improved following pooling but only mildly influenced by the local normalization operation. Higher layers of the CNN could further, though modestly, improve the compatibility with the V2 data. The compatibility was reduced when incorporating random rather than learned weights. Our results show that the CNN class of model is effective for capturing changes that develop across early areas of cortex, and has the potential to help identify the computations that give rise to hierarchical processing in the brain (code is available in GitHub).

*convolution*in the neural network community. This corresponds to the cross-correlation between an input image and a filter.

*x*,

*y*) position in each

*i*th channel, then the normalized response \(b_{x,y}^{i}\) is defined by Krizhevsky et al. (2012):

*m*is the size of the normalization neighborhood, and

*N*is the total number of model units in the layer. Constants

*k*,

*m*, α, β are the hyperparameters with the default values of 2, 5, 10

^{−4}, and 0.75, respectively.

*family*and all the images within the same family as

*samples*. Naturalistic textures for a given family were generated each with a different random seed. Spectrally matched noise images (which we denote noise images) were generated by randomizing the phase of the synthetic images. The noise images have the same spatial frequency distribution of energy as the original ones but lack the differences in higher-order statistics.

*t*-distributed stochastic neighbor embedding (

*t*-SNE) (van der Maaten, 2014; van der Maaten & Hinton, 2008) algorithm to achieve this visualization.

*t*-SNE is a technique for dimensionality reduction that tries to model small pairwise distances to capture local data structures in a low-dimensional space.

*r*

_{na}and

*r*

_{no}are the responses to naturalistic textures and noise, respectively. Figure 3 (top panel) shows the average modulation index for all texture families in the CNN, for L1 (

*red*) and L2 (

*blue*).

*p*< 0.0000005,

*t*test considering signs;

*p*< 0.00001,

*t*test ignoring signs and considering only the magnitudes) and is qualitatively comparable but stronger than the neurophysiology data (V1: ≈0.00 and V2: ≈0.12; Freeman et al., 2013). More specifically, Figure 3 (top panel) shows that the L2 modulation is more drastic in some texture families than others, as also observed for the V2 data (Freeman et al., 2013). However, the rank order of the textures was different between the CNN and the neurophysiology data, which are shown in Figure 3 (bottom panel) (prompting our quantitative subset selection approaches below).

*p*< 0.000003,

*t*test). This means that L1 and L2 could still be differentiated according to this metric. But note that the L2 modulation index is substantially reduced with randomization of only the L2 weights. This means that incorporating learned L2 weights leads to a much higher sensitivity to textures than random L2 weights, so learning in the second layer adds to the texture sensitivity that is developed.

*p*= 0.0025,

*t*test on the magnitude). In the case of randomized weights in both L1 and L2 (L1L2rand), the range of modulation indices is −0.01 to 0.03. This range is too small to capture the V2 neuron modulation indices. We see that the texture sensitivity of L2 neurons breaks in the complete absence of the trained weights and shows very low modulation similar to L1. This indicates that the sensitivity to high-order statistics like textures is not a trivial outcome of the deep network architecture. The CNN model with trained weights learned from natural images corresponds better to the neurophysiology data than the CNN architecture itself.

*subset greedy*, to choose a subset of 103 model neurons that best match the neurophysiology data from the brain. Briefly, from the set of all possible model neurons, the greedy approach chooses the first neuron with the closest Euclidean distance to the V2 mean modulation index data; then, the second neuron is added to this subset so as to minimize the Euclidean distance and so on until a total of 103 neurons are chosen (see Technical Methods).

*full population*approach. The full population approach finds a weighted sum of the model neurons (under the constraint that the weights are nonnegative and sum to 1) that is the closest in squared Euclidean distance to the experimental data. Notice that the weighted average may include all available neurons and weight neurons differently. The greedy approach is, in contrast, an approximation that finds a subset of 103 neurons with equal weights that best matches the neural data.

*subset regularized*.

*Euclidean error*distance

*E*between the mean modulation indices in the neurophysiological data versus the modulation indices obtained from the CNN for each family. A smaller Euclidean distance indicates a better fit to the V2 data and higher correspondence to the brain (see Technical Methods section for details). The rationale behind using the Euclidean distance as a measure of correspondence is that it is directly related to the root mean squared error (MSE) up to a normalization constant. We chose an error metric in the fitting that is sensitive to absolute values rather than relative values, since we are fitting modulation indices. Our optimal weighted and subset regularized fits are done in terms of squared Euclidean distance, which for the optimal fitting method makes the error and regularization terms work at similar scales. In the subset greedy approach, MSE and Euclidean (and even squared Euclidean) distances indicate the same outcome. Second, we quantified the fits between the V2 data and the CNN using

*Spearman’s rank-order correlation*, in which a larger correlation corresponds to a better fit.

*R*

^{2}) was 0.60 for the subset greedy, 0.54 for the subset regularized, and 0.70 for the full population. In contrast, the explained variance for the random population of 103 neurons was 0.40. Note that this represents a lower bound, since we are not considering the variability due to samples in a family, nor are we taking into account variability in the experiments due to stimulus repeats.

*Layer 1*) exhibited the highest fitting errors and lowest correlation among all layers and controls, meaning they resulted in a poor fit (hence little correspondence) to the neurophysiology V2 data (see also Figures 7a–c). The trained L2 model neurons exhibited the lowest fitting errors (greedy subset 0.22, full population 0.19, and regularized subset 0.24) and the highest correlations (greedy subset 0.80, full population 0.86, and regularized subset 0.75) in all fitting techniques, meaning that L2 achieved better correspondence to the V2 data (Figures 7d–f).

*rand 2*) led to significantly worse fits than when both the L1 and L2 weights were trained in all three model neuron selection techniques (

*p*< 0.000002 in greedy;

*p*< 0.00007 in optimal;

*p*< 0.00002 in regularized; one-sample

*t*test). This indicated that training on the full CNN model (i.e., both the L1 and L2 weights) led to an improvement of the fit versus training on the L1 weights alone. Randomizing both Layer 1 and Layer 2 weights (

*rand 12*) lead to a more dramatic increase of the errors in the fitting (see also Figure 8). This indicated that training the first layer alone went some way in obtaining a better fit. Overall, the CNN model trained on natural images had the closest fit to the neurophysiology V2 data, in line with the qualitative results that we showed in the earlier portion of the article. One can arguably say that relevant texture statistics come from interactions at particular frequencies, which cannot be captured by random weights because of their poor frequency localization.

*rand 123*) and randomizing Layer 1, 2, 3, and 4 weights (

*rand 1234*). For these conditions, we fit the outputs of Layers 3 and 4, respectively, to the data. The goal here was to see if stacking more random layers helped in obtaining a better fit to the data. However, the error remained high even when we stacked together four layers (compare

*rand 12*with

*rand 123*and

*rand 1234*). Therefore, a deeper random network did not rescue the fit.

*shuffle 12*). We found that this resulted in a slightly better fit than the randomized counterpart, but still remained poor (compare

*rand 12*vs.

*shuffle 12*).

*conv2*(i.e., after only the convolution in the second layer) had high fitting errors. This is because the response from the conv layers can be negative before the ReLU. We found that compatibility to the V2 data starts to develop already after the rectification in the second layer (i.e., ReLU2). This is indicated by the Euclidean fitting error (greedy: from 0.62 in L1 to 0.33 in L2; full population: from 0.59 in L1 to 0.31 in L2; subset regularized: from 0.62 in L1 to 0.34 in L2). The fitting errors after

*pooling*improved (subset greedy 0.22; full population 0.20; subset regularized 0.24). After the local normalization (i.e., the point in L2 that we initially referred to in all our measurements), the fitting errors were subset greedy 0.22, full population 0.19, and subset regularized 0.24. The main improvement in the fit appeared to be at the L2 rectification (ReLU) stage.

*norm1*) and L2 (

*norm2*) layers altogether?

*rand 2*vs.

*rand 2 noN*;

*p*< 0.001 in greedy,

*p*= 0.70 in full population, which was high and did not pass significance, and

*p*< 0.009 in subset regularized; independent sample

*t*test). Taken together, our results suggest that normalization had only a mild role in improving compatibility for the texture data.

- • The CNN L1 was not compatible with the V2 texture data (Figures 7a–c), as revealed by the large Euclidean error and small Spearman’s rank-order correlation in the model fits (see Figures 9a and 9b).
- • The CNN L2 showed a marked decrease in error and increase in the Spearman’s correlation in fitting the V2 data (see Figures 9a and 9b).
- • The compatibility between L2 and V2 was first observed following rectification and improved with pooling; local normalization only had a mild effect.
- • The compatibility between L2 and V2 showed some modest improvement for higher layers of the CNN and then slightly decreased again in L5, as quantified by the Euclidean error and Spearman’s correlation (see Figures 10a and 10b).
- • The CNN training on natural images was important to obtain compatibility between L2 and V2; randomizing or shuffling the weights reduced this compatibility (see Figures 9a and 9b).

*l*) is given by the mean pixel intensity of the downsampled image (

*I*

_{d}). The contrast (

*c*) is given by the standard deviation. The contrast normalized images (

*I*

_{n}) are then defined as follows:

*forward selection*to choose a subset of 103 model neurons that best match the data from the cortical neurons. In the greedy approach, the goal is to build a subset incrementally by adding neurons, one at the time, that in conjunction with previously selected neurons minimizes the Euclidean distance between the neural data and the CNN model modulation indices. This incremental process continues until we have chosen 103 neurons from the available CNN layer population. The approach is greedy, because it optimizes the selection of the next neuron as best it can given the current set of neurons. However, it does not guarantee a globally optimal solution.

*n*CNN model neurons, and \(\mathbf {m}_{j}\) the average modulation indices per texture family of the

*j*th simulated neuron in \(\mathcal {A}\). Starting from \(\mathcal {S}^{(0)} = \emptyset\), the greedy algorithm adds a neuron to the current set of selected model neurons that minimizes the squared Euclidean distance between the neuronal data and model average modulation indices:

*n*simulated model neurons. In terms of Euclidean distance, this is the best fit that could be attained by considering a weighted average on the simulated neurons. Nevertheless, note that the solution need not be sparse since there is no mechanism forcing the weights to become zero, and the disparity of the weight values can be hard to interpret. Fitting results of this technique is shown in Figure 7 (second column).

*w*

_{i}can be very disparate. Since we seek to select a subset of the simulated neurons whose regular average (all weights are equal) follows closely the physiological experiments, we relax the selection problem by solving a regularized version of the optimization problem (5) as follows:

*w*

_{i}≥ 2

*e*

^{−3}with only a handful of them containing large values that account for \(\sum _i^n w_i = 1\). As λ increases, the regularization term pushes the weights towards the center of the simplex. For instance, for λ = 0.8, we found that approximately \(40\%\) of the model neurons have weights

*w*

_{i}≥ 2

*e*

^{−3}. The subset of model neurons is selected by applying a threshold to the estimated weights, as proposed in Li, Sundar Rangapuram, and Slawski (2016), and then choosing the 103 neurons with the highest weights. However, a main difference from Li et al. (2016) is that our two-stage procedure is applied to the solution of (5) instead of (4). This approach also yields an excellent fit to the V2 data for L2 model neurons, as shown in Figure 7 (third column). We used λ = 0.8 for all fits; lower λ increased the fitting error but did not alter the trends (and vice versa).

*n*-space, the Euclidean distance \(\mathrm{d}(\mathbf {x}, \mathbf {y})\) is computed using the 2-norm as follows:

*n*is usually 15, the number of texture families, as we take the average over samples and/or model neurons. Lower Euclidean distances indicate a better fit of the model to the V2 data and therefore higher correspondence of the model to the brain.

*Visual Neuroscience,*7, 531–546. [CrossRef]

*Experimental Brain Research,*189, 109–120. [CrossRef]

*Nature,*333, 363–364. [CrossRef]

*PLoS Computational Biology,*15, e1006897. [CrossRef]

*PLoS Computational Biology,*10, e1003963. [CrossRef]

*Nature Reviews Neuroscience,*13, 51–62. [CrossRef]

*Nature Neuroscience,*18, 1648–1655. [CrossRef]

*Journal of Vision,*13, 1–20, doi:10.1167/13.8.13. [CrossRef]

*Journal of Neuroscience,*31, 8543–8555. [CrossRef]

*Nature Neuroscience,*16, 974–981. [CrossRef]

*Vision Research,*32, 1409–1410. [CrossRef]

*Journal of Neuroscience,*35, 1005–10014. [CrossRef]

*Journal Neuroscience,*21, 4490–4497. [CrossRef]

*Visual Neuroscience,*9, 181–197. [CrossRef]

*Proceedings of the National Academy of Sciences USA,*93, 623–627. [CrossRef]

*Journal of Neuroscience,*20, 1–6. [CrossRef]

*Journal of Neuroscience,*35, 10412–10428. [CrossRef]

*Journal of Neuroscience,*24, 3313–3324. [CrossRef]

*Scientific American,*232, 34–43. [CrossRef]

*Nature,*290, 91–97. [CrossRef]

*PLoS Computational Biology,*10, e1003915.

*Journal of Vision,*19(4), 29, doi:10.1167/19.4.29.

*Annual Review of Vision Science,*1, 417–446.

*Nature,*521, 436–444.

*Proceedings of the IEEE,*86, 2278–2324.

*Journal of the Optical Society of America A,*923–932.

*Visual Neuroscience,*31, 8543–8555.

*Neuron,*78, 10780–10793.

*Proceedings of National Academy of Sciences USA,*112, E351–E360.

*Cerebral Cortex,*27, 4867–4880.

*International Journal of Computer Vision,*40, 49–71.

*International Journal of Computer Vision,*115, 211–252.

*Journal of Neuroscience,*30, 12978–12995.

*Neural Computation,*31, 2138–2176.

*Nature Neuroscience,*4, 819.

*IEEE Transactions on Systems, Man, and Cybernetics,*8, 460–473.

*Journal of Machine Learning Research,*15, 3221–3245.

*Journal of Machine Learning Research,*9, 2579–2605.

*Journal of Vision,*17(5), 1–29, doi:10.1167/17.12.5.

*Nature Neuroscience,*19, 356–365.

*Proceedings of the National Academy of Sciences USA,*111, 8619–8624.

*Neuron,*47, 143–153.

*IEEE Transactions on Pattern Analysis and Machine Intelligence,*40(6), 1452–1464.

*Journal of Neuroscience,*20, 6594–6611.

*Frontiers of Computational Neuroscience,*11: 100, 1–17.

*Proceedings of the National Academy of Sciences USA,*113, E3140–E3149.

*Journal of neurophysiology,*120(2), 409–420.

*Selectivity of contextual modulation in macaque V1 and V2*. Paper presented at the Annual Meeting, Neuroscience. Society for Neuroscience, Washington, D.C.