August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Evaluating machine comprehension of sketch meaning at different levels of abstraction
Author Affiliations & Notes
  • Xuanchen Lu
    University of California, San Diego
  • Kushin Mukherjee
    University of Wisconsin-Madison
  • Rio Aguina-Kang
    University of California, San Diego
  • Holly Huey
    University of California, San Diego
  • Judith E. Fan
    University of California, San Diego
  • Footnotes
    Acknowledgements  This work was supported by NSF CAREER Award #2047191.
Journal of Vision August 2023, Vol.23, 5938. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Xuanchen Lu, Kushin Mukherjee, Rio Aguina-Kang, Holly Huey, Judith E. Fan; Evaluating machine comprehension of sketch meaning at different levels of abstraction. Journal of Vision 2023;23(9):5938.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Human visual understanding is not unitary, but spans multiple levels of abstraction. For example, people know that real-world objects possess both distinctive features and features shared with other members of the same category. One of the most salient manifestations of this ability is in the domain of sketch comprehension: while one sketch may be evocative of a general category such as “flower” without resembling any particular flower, another sketch may faithfully depict a specific rose. To what degree are current vision algorithms sensitive to such variation in the degree of semantic abstraction expressed in sketches? Here we leveraged a recently collected dataset containing 6,144 sketches of 1,024 real-world objects belonging to 32 categories (Yang & Fan, 2021). Half of these drawings were intended to depict a general category, while the other half were intended to depict a specific object in a color photograph. We measured a sketch’s “genericity” by its ability to selectively evoke its target category but not strongly evoke any exemplar within that category; similarly, we measured a sketch’s “specificity” as its ability to evoke the exemplar it was meant to depict but not other exemplars. Using these metrics and dataset, we fit mixed-effect linear regression models to evaluate two state-of-the-art vision models: VGG-19, a convolutional neural network trained on image categorization tasks; and CLIP-ViT, a transformer-based model trained to distinguish individual image-text pairs. We found that the latent representation in CLIP-ViT was more sensitive overall to these sketches’ genericity (p<.05) and specificity (p<.001) than that learned by VGG-19. Additionally, we found that the degree to which exemplar sketches achieved higher specificity than category sketches was greater for CLIP-ViT than for VGG-19 (p<.001). More broadly, this work provides a general protocol for evaluating how well models emulate key aspects of human sketch comprehension.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.