Abstract
A major obstacle to understanding human visual object recognition is our lack of behaviourally faithful models. Even the best models based on deep learning classifiers strikingly deviate from human perception in many ways. To study this deviation in more detail, we collected a massive set of human psychophysical classification data under highly controlled conditions (17 datasets, 85K trials across 90 observers). We made this data publicly available as an open-sourced Python toolkit and behavioural benchmark called "model-vs-human", which we use for investigating the very latest generation of models. Generally, in terms of robustness, standard machine vision models make much more errors on distorted images, and in terms of image-level consistency, they make very different errors than humans. Excitingly, however, a number of recent models make substantial progress towards closing this behavioural gap: "simply" training models on large-scale datasets (between one and three orders of magnitude larger than standard ImageNet) is sufficient to, first, reach or surpass human-level distortion robustness and, second, to improve image-level error consistency between models and humans. This is significant given that none of those models is particularly biologically faithful on the implementational level, and in fact, large-scale training appears much more effective than, e.g., biologically-motivated self-supervised learning. In the light of these findings, it is hard to avoid drawing parallels to the "bitter lesson" formulated by Rich Sutton, who argued that "building in how we think we think does not work in the long run" - and ultimately, scale would be all that matters. While human-level distortion robustness and improved behavioural consistency with human decisions through large-scale training is certainly a sweet surprise, this leaves us with a nagging question: Should we, perhaps, worry less about biologically faithful implementations and more about the algorithmic similarities between human and machine vision induced by training on large-scale datasets?