August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Can deep convolutional networks explain the semantic structure that humans see in photographs?
Author Affiliations
  • Siddharth Suresh
    University of Wisconsin-Madison, Department of Psychology
    McPherson Eye Research Institute
  • Kushin Mukherjee
    University of Wisconsin-Madison, Department of Psychology
    McPherson Eye Research Institute
  • Timothy T. Rogers
    University of Wisconsin-Madison, Department of Psychology
    McPherson Eye Research Institute
Journal of Vision August 2023, Vol.23, 5634. doi:https://doi.org/10.1167/jov.23.9.5634
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Siddharth Suresh, Kushin Mukherjee, Timothy T. Rogers; Can deep convolutional networks explain the semantic structure that humans see in photographs?. Journal of Vision 2023;23(9):5634. https://doi.org/10.1167/jov.23.9.5634.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Deep convolutional networks (DCN) have been proposed as useful models of the ventral visual processing stream. This study evaluates whether such models can capture the rich semantic similarities that people discern amongst photographs of familiar objects. We first created a new dataset that merges representative images of everyday concepts (taken from the Ecoset) with the large semantic feature set collected by the Leuven group. The resulting set includes ~300,000 images depicting items in 86 different semantic categories including 46 animate items (reptiles, insects and mammals) and 40 inanimate (vehicles, instruments, tools, and kitchen items). Each category is also associated with values on a set of ~2000 semantic features, generated by human raters in a prior study. We then trained two variants of the AlexNet architecture on these items: one that learned to activate just the corresponding category label, and a second that learned to generate all of an item’s semantic features. Finally, we evaluated how accurately the learned representations in each model could predict human decisions in a triplet-judgment task conducted using photographs from the training set. Both models predicted some human triplet judgments better than chance, but the model trained to output semantic feature vectors performed better and captured more levels of semantic similarity. Neither model, however, performed as well as an embedding computed directly from the semantic feature norms themselves. The results suggest that deep convolutional image classifiers alone do a poor job capturing the semantic similarity structure that drives human judgments, but that alterations in the training task–in particular, training on output vectors that express richer semantic structure–can greatly overcome this limitation.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×