Abstract
A central aim in vision science is to elucidate the structure of human mental representations of objects. A key ingredient to this endeavor is the assessment of psychological similarities between objects. A challenge of current methods for exhaustively sampling similarities is their high resource demand, which grows non-linearly with the number of stimuli. To overcome this challenge, here we introduce an efficient method for generating similarity scores of real-world object images, using a combination of deep neural network activations and human feature ratings. Rather than directly predicting similarity for pairs of images, our method first predicts each image’s values on a set of 49 previously established representational dimensions (Hebart et al., 2020). Then, these values are used to generate similarities for arbitrary pairs of images. We evaluated the performance of this method using dimension predictions derived from the neural network architecture CLIP-ViT as well as direct human ratings of object dimensions collected through online crowdsourcing (n = 25 per dimension). Human ratings were collected on a set of 200 images, and generated similarity was evaluated on two separate sets of 48 images. CLIP-ViT performed very well at predicting global similarity for 1,854 objects (r = 0.89). Applying CLIP-ViT predictions to three existing neuroimaging datasets rivaled and often even outperformed previous existing behavioral similarity datasets. For the 48 image sets, both humans and CLIP-ViT provided good predictions of image dimension values across several datasets, leading to very good predictions of similarity (humans: R2 = 74-77%, CLIP-ViT: R2 = 76-82% explainable variance). Combining dimension predictions across humans and CLIP-ViT yielded a strong additional increase in performance (R2 = 84-87%). Together, our method offers a powerful and efficient approach for generating similarity judgments and opens up the possibility to extend research using image similarity to large stimulus sets.