Abstract
Deep convolutional networks (DCN) have been proposed as useful models of the ventral visual processing stream. This study evaluates whether such models can capture the rich semantic similarities that people discern amongst photographs of familiar objects. We first created a new dataset that merges representative images of everyday concepts (taken from the Ecoset) with the large semantic feature set collected by the Leuven group. The resulting set includes ~300,000 images depicting items in 86 different semantic categories including 46 animate items (reptiles, insects and mammals) and 40 inanimate (vehicles, instruments, tools, and kitchen items). Each category is also associated with values on a set of ~2000 semantic features, generated by human raters in a prior study. We then trained two variants of the AlexNet architecture on these items: one that learned to activate just the corresponding category label, and a second that learned to generate all of an item’s semantic features. Finally, we evaluated how accurately the learned representations in each model could predict human decisions in a triplet-judgment task conducted using photographs from the training set. Both models predicted some human triplet judgments better than chance, but the model trained to output semantic feature vectors performed better and captured more levels of semantic similarity. Neither model, however, performed as well as an embedding computed directly from the semantic feature norms themselves. The results suggest that deep convolutional image classifiers alone do a poor job capturing the semantic similarity structure that drives human judgments, but that alterations in the training task–in particular, training on output vectors that express richer semantic structure–can greatly overcome this limitation.