October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Finding meaning in simple sketches: How do humans and deep networks compare?
Author Affiliations
  • Kushin Mukherjee
    University of Wisconsin-Madison
  • Timothy T. Rogers
    University of Wisconsin-Madison
Journal of Vision October 2020, Vol.20, 1026. doi:https://doi.org/10.1167/jov.20.11.1026
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kushin Mukherjee, Timothy T. Rogers; Finding meaning in simple sketches: How do humans and deep networks compare?. Journal of Vision 2020;20(11):1026. https://doi.org/10.1167/jov.20.11.1026.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Picasso famously showed that a single unbroken line, curved and angled just so, can depict a dog, penguin, or camel for the human viewer. What accounts for the ability to discern meaning in such abstract stimuli? Deep convolutional image classifiers suggest one possibility: perhaps the visual system, in learning to recognize real objects, acquires features sufficiently flexible to capture meaningful structure from much simpler figures. Despite training only on color photographs of real objects, such models can recognize simple sketches at human levels of performance (Fan, Yamins, & Turk-Browne, 2018). We consider whether the internal representations arising in such a model can explain the perceptual similarities people discern in sketches of common items. Using a triadic comparison task, we crowdsourced similarity judgments for 128 sketches drawn from 4 categories—birds, cars, chairs, and dogs (Mukherjee, Hawkins, & Fan, 2019). On each trial, participants decided which of two sketches was most perceptually similar to a third. From thousands of judgments we computed low-dimensional nonmetric embeddings, then compared these human-derived embeddings to representational structures extracted for the same sketches from the deepest fully-connected layer of the VGG-19 image classifier. VGG-19 representations predicted human triadic comparison judgments with 59% accuracy--reliably better than chance, but still quite poor given chance performance of 50%. Embeddings derived from human judgments predicted held-out judgments with 75% accuracy. 2D embeddings derived from VGG-19 vs triadic-comparison differed starkly, with semantic category structure dominating the human-derived embedding and only weakly discernable in network representations. And yet network representations reliably captured some semantic elements: latent components predicted whether a given sketch depicted a living or non-living thing with 90% accuracy. Thus while the visual features extracted by VGG-19 discern some semantic structure in sketches, they provide only a limited account of the human ability to find meaning in abstract visual stimuli.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.