Figure 4 summarizes the outcome of simulations when only horizontal displacements were used, consistent with our psychophysical experiments. When the CNN model included a GAP layer perfect on-line translation invariance was observed across all displacements, regardless of the pretraining or jitter (dashed red line in all 4 panels of
Figure 4B). By contrast, when a standard CNN model was used (solid lines in
Figure 4B), robust translation tolerance was only observed when the model was pretrained on ImageNet (
Figure 4B, bottom panels), and following pretraining, there was a small reduction in performance following translations, with performance dropping from approximately 100% to approximately 85%. Jitter only had a minor effect on the pretrained model, but improved performance on the untrained model in the region of the jitter. To facilitate comparison between human and CNN performance, the black dashed line depicted in the bottom right panel of
Figure 4B plots mean human performance (collapsing over experiments) after converting each displacement (0 degrees, 3 degrees, 6 degrees, 9 degrees, and 18 degrees) to an equivalent measurement in pixels based on the proportion of the image size.
3 Clearly, the pretrained standard CNNs (without GAP) accounts for the human data most closely, and is consistent with the fact that the humans in our experiment had extensive previous experience seeing familiar objects at multiple retinal locations.
We also repeated the simulations across a greater range of displacements. As illustrated in
Figure 5, each Leek image was tested on a 19 × 19 grid in the canvas, centering every stimulus at each point of the grid. The results were averaged across 20 replications, and the untested points in the canvas were estimated through cubic interpolation. The results highlight even more clearly the limited on-line translation tolerance obtained with standard deep convolutional neural networks (DCNNs) without pretraining, the extreme on-line translation tolerance obtained with standard DCNNs that were pretrained on ImageNet, the limited impact of jitter, and the compete online translation invariance obtained with DCNNs that include a GAP layer regardless of pretraining.