Abstract
Picasso famously showed that a single unbroken line, curved and angled just so, can depict a dog, penguin, or camel for the human viewer. What accounts for the ability to discern meaning in such abstract stimuli? Deep convolutional image classifiers suggest one possibility: perhaps the visual system, in learning to recognize real objects, acquires features sufficiently flexible to capture meaningful structure from much simpler figures. Despite training only on color photographs of real objects, such models can recognize simple sketches at human levels of performance (Fan, Yamins, & Turk-Browne, 2018). We consider whether the internal representations arising in such a model can explain the perceptual similarities people discern in sketches of common items. Using a triadic comparison task, we crowdsourced similarity judgments for 128 sketches drawn from 4 categories—birds, cars, chairs, and dogs (Mukherjee, Hawkins, & Fan, 2019). On each trial, participants decided which of two sketches was most perceptually similar to a third. From thousands of judgments we computed low-dimensional nonmetric embeddings, then compared these human-derived embeddings to representational structures extracted for the same sketches from the deepest fully-connected layer of the VGG-19 image classifier. VGG-19 representations predicted human triadic comparison judgments with 59% accuracy--reliably better than chance, but still quite poor given chance performance of 50%. Embeddings derived from human judgments predicted held-out judgments with 75% accuracy. 2D embeddings derived from VGG-19 vs triadic-comparison differed starkly, with semantic category structure dominating the human-derived embedding and only weakly discernable in network representations. And yet network representations reliably captured some semantic elements: latent components predicted whether a given sketch depicted a living or non-living thing with 90% accuracy. Thus while the visual features extracted by VGG-19 discern some semantic structure in sketches, they provide only a limited account of the human ability to find meaning in abstract visual stimuli.