Abstract
Convolutional neural networks (CNNs) have become a standard for modeling several aspects of human visual processing, especially natural object classification, where they rival humans in performance. Most recent work on improving the correspondence between CNNs and humans has focused on low-level architectural modifications, and has paid less attention to changes in training supervision. We identify one way in which the training objective of the network differs greatly from that of humans: CNNs are almost exclusively trained on fine-grained, subordinate-level labels (e.g., Dalmatian), while humans also make use of more coarse-grained, basic-level labels (e.g., dog) that unify otherwise perceptually divergent subordinate classes. We show through a series of experiments that the level of abstraction in the labels used to train the network determines to a large extent how it will generalize, and consequently its correspondence with human generalization behavior.