To add a comparison to the work (Cadena et al.,
2017), we constructed five models from a deep CNN called VGG-16 that has been pretrained on an image-classification task (Simonyan & Zisserman,
2014). We constructed these models out of the activations of VGG at five different depths along the deep network in response to our image set. To do this, we trimmed the outer edges of the small images and cropped down to 224 × 224 pixels, then copied the grayscale images into each of the R, G, and B channels to match the 224 × 224 × 3 input size of VGG. (This duplicates the fact that the monkey has the three input channels but saw grayscale images.) We then passed these images through the (already trained) VGG-16 model and extracted the activations from each layer. Of the layers, we focused on convolutional blocks 2 and 3 because the LASSO fitting is much slower on such large inputs (e.g., >590,000 units in convolutional 3 block 2), and Cadena et al. (
2017) show that these blocks provide the best predictions of V1 firing rates. For each layer's activations, we selected those units whose activations had nonzero variance over the set of images; the ones with zero variance occurred because they look at the gray parts of the images. We used the activations of these units to predict the neurons' mean firing rates, using L1-regularized (sparse) LASSO regression. The regression is on the weights
W and biases
b according to
Equation 4, where the variables
zi,j are VGG activations within the given layer. The five VGG layers we considered are Conv2,1, Conv2,2, Conv3,1, Conv3,2, and Conv3,3 (where Conv
a,
b denotes convolutional layer
b within block
a).