Abstract
Human visual understanding is not unitary, but spans multiple levels of abstraction. For example, people know that real-world objects possess both distinctive features and features shared with other members of the same category. One of the most salient manifestations of this ability is in the domain of sketch comprehension: while one sketch may be evocative of a general category such as “flower” without resembling any particular flower, another sketch may faithfully depict a specific rose. To what degree are current vision algorithms sensitive to such variation in the degree of semantic abstraction expressed in sketches? Here we leveraged a recently collected dataset containing 6,144 sketches of 1,024 real-world objects belonging to 32 categories (Yang & Fan, 2021). Half of these drawings were intended to depict a general category, while the other half were intended to depict a specific object in a color photograph. We measured a sketch’s “genericity” by its ability to selectively evoke its target category but not strongly evoke any exemplar within that category; similarly, we measured a sketch’s “specificity” as its ability to evoke the exemplar it was meant to depict but not other exemplars. Using these metrics and dataset, we fit mixed-effect linear regression models to evaluate two state-of-the-art vision models: VGG-19, a convolutional neural network trained on image categorization tasks; and CLIP-ViT, a transformer-based model trained to distinguish individual image-text pairs. We found that the latent representation in CLIP-ViT was more sensitive overall to these sketches’ genericity (p<.05) and specificity (p<.001) than that learned by VGG-19. Additionally, we found that the degree to which exemplar sketches achieved higher specificity than category sketches was greater for CLIP-ViT than for VGG-19 (p<.001). More broadly, this work provides a general protocol for evaluating how well models emulate key aspects of human sketch comprehension.