Abstract
Behavioral comparisons of human and deep neural network (DNN) models of object recognition help to benchmark and improve DNN models but also might help to illuminate the intricacies of human visual perception. However, machine-to-human comparisons are often fraught with difficulty: Unlike DNNs, which typically learn from scratch using static, uni-modal data, humans process continuous, multi-modal information and leverage prior knowledge. Additionally, while DNNs are predominantly trained in a supervised manner, human learning heavily relies on interactions with unlabeled data. We address these disparities by attempting to align the learning processes and examining not only the outcomes but also the dynamics of representation learning in humans and DNNs. We engaged humans and DNNs in a task to learn representations of three novel 3D object classes. Participants completed six epochs of an image classification task—reflecting the train-test iteration process common in machine learning—with feedback provided only during training phases. To align the starting point of learning we utilized pre-trained DNNs. This experimental design ensured that both humans and models learn new representations from the same static, uni-modal inputs in a supervised learning environment. We collected ~6,300 trials from human participants in the laboratory and compared the observed dynamics with various DNNs. While DNNs exhibit learning dynamics with fast training progress but lagging generalization, human learners often display a simultaneous increase in train and test performance, showcasing immediate generalization. However, when solely focusing on test performance, DNNs show good alignment with the human generalization trajectory. By synchronizing the learning environment and examining the full scope of the learning process, the present study offers a refined comparison of representation learning. Collected data reveals both similarities and differences between human and DNN learning dynamics. This disparity emphasizes that global assessments of DNNs as models of human visual perception seem problematic without considering specific modeling objectives.