Furthermore, we aimed at disentangling the contribution of training and architecture to the observed representational similarities between types of depiction. To this end, we ran our stimuli through a variant of VGG-16 with randomly initialized weights and obtained RDMs for the different types of depictions across layers. Correlating these RDMs between types of depiction, we found that for all comparisons there were significant differences in correlations for all comparisons across layers (all
p < 0.001, one-sided randomization test). Pairwise tests revealed an increase in correlation for all comparisons between pooling layer one and pooling layer four (all
p < 0.003, one-sided randomization test, FDR-corrected;
Figure 5). After pooling layer four, similarities either increased further for the photo-to-drawing comparison (
p = 0.041, one-sided randomization test, FDR-corrected) or remained at a level that was not significantly different for the photo-to-sketch and drawing-to-sketch comparisons (all
p > 0.318). Next, we directly compared the similarities from the randomly initialized VGG-16 and the ImageNet-trained variant. For the photo-to-drawing similarity, the similarities were either not significantly different or higher in VGG-16 IN in the early layers up to pooling layer four (pool 1:
p = 0.263; pool 2:
p = 0.005; pool 3:
p = 0.033; and pool 4: p = 0.059, one-sided randomization test, FDR-corrected). This pattern reversed in the late layers from pooling layer five, with significantly lower similarities in VGG-16 IN than in the randomly initialized network (all
p < 0.005). A similar pattern was found for the photo-to-sketch similarity; in early layers, there were no differences up to pooling layer four (all
p > 0.123, one-sided randomization test, FDR-corrected), whereas, in late layers, similarities were lower for VGG-16 IN (all
p = 0.006). Finally, for the drawing-to-sketch comparison, there were no significant differences in similarity between both networks (all
p > 0.163). In sum, this suggests that at least a part of the representational similarities between photos and both drawings and sketches in early and intermediate layers can be accounted for by the architecture of the network. Yet, these results indicate that training improved the similarities even further in these layers. In contrast, the drop in representational similarities in late layers cannot be explained by the architecture alone and may be related to biases induced by the training of the network. Finally, the similarity between drawings and sketches appears to be unaffected by training.