Purchase this article with an account.
Talia Konkle, George Alvarez; Deepnets do not need category supervision to predict visual system responses to objects. Journal of Vision 2020;20(11):498. doi: https://doi.org/10.1167/jov.20.11.498.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Substantial progress has been made understanding the nature of the representations along the visual hierarchy by focusing on the goal of core object recognition. As evidence, many deep neural networks trained only to do object categorization develop internal representations that are closely aligned with primate cortical responses. However, these recognition-trained deepnets receive extensive supervision with millions of object category labels, unlike natural visual experience. Here we test the hypothesis that visual system responses may be better captured under a different, unsupervised goal—that is, to remember and uniquely represent everything it sees.
To examine this possibility, we trained several models with a common base architecture (alexnet, resnet18, and corrnet), but with different tasks. Supervised models were trained on 1000-way object categorization. Unsupervised models were trained to maximize instance-level discrimination within a fixed size memory-bank (leveraging and extending Wu et al., 2018). The size of the memory bank was varied in different models to have 128, 256, or 1000 dimensions. Human brain responses were measured with functional magnetic resonance imaging (n=10), to a set of 72 inanimate object images, and we calculated the similarity structure evident in both the responses of object-selective cortex and all deepnet model layers.
Overall we observed that human object-selective cortex responses were equal to or better captured by the unsupervised instance-level deep nets than the supervised categorization deep nets, particularly in the later representational stages. These results were robust across memory-bank sizes and base network architectures.
These data provide the first evidence to our knowledge that unsupervised networks can fit brains better than matched supervised networks. Further, these data provide computational support for the broader hypothesis that the visual system’s goal may be better conceptualized as providing unique, compressed descriptions of the visual world.
This PDF is available to Subscribers Only