Abstract
Invariant object recognition is a hallmark of human vision. Humans recognize objects across a wide range of rotations, positions, and scales. A good model of human object recognition should, like humans, be able to generalize across real-world object transformations. Deep neural networks are currently the most popular computational models of the human ventral visual stream. Prior studies reported that these models show signatures of invariant object recognition but showed mixed results on how closely the models match human performance. Inconsistencies across studies in the ability of deep neural networks to recognize objects across transformations may be due to differences in the tested model architectures or training regimes. Here we test object recognition performance for different families of pretrained feedforward deep neural networks across object rotation, position, and scale. We included 95 models and defined model families based on three dimensions: model architecture, visual diet, and learning objective. Along the model architecture dimension, we tested convolutional neural networks and visual transformers. For each architecture, we tested models trained on relatively poor and relatively rich visual diets, ranging from 1.2 to 14 million training images, and models trained with supervised and unsupervised learning objectives. We created test images using ThreeDWorld, a 3D virtual world simulation platform that includes 583 3D objects from 58 ImageNet categories. We found that all tested model families show a drop in object recognition performance after applying object transformations, with lowest performance for object rotation and scale. Model architecture did not noticeably affect model performance, but models trained with rich visual diets and unsupervised generative learning objectives outperformed the other model families in our set. Our results suggest that, while different models agree on which object transformations are most challenging, visual diet and learning goals affect their ability to match human performance at invariant object recognition.