Abstract
This study investigates the ways humans and machines judge the visual similarity of paintings pulled from the WikiArt dataset–in terms of their styles, their content, and holistically–by conducting a behavioral study with human participants, training a deep neural network, and then examining behavioral data in tandem with model results. In the behavioral study, participants rated pairs of paintings in terms of their style, subject matter, and overall visual similarity. To extract similarity judgments from neural networks, we first compared results from three models. Two were models pre-trained on ImageNet: VGG-16 (Simonyan & Zisserman, 2014) and AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), 16 and 8 layers respectively. The third was a basic CNN with 5 layers. These CNNs were trained to perform style classification on 4 art styles, with the best model achieving an accuracy of 48.95%. We used the final layer of the neural networks to compute cosine similarity ratings for the same pairs of paintings we showed participants. Overall, we found that the best performing CNN modeled human similarity judgments well, as long as we constrained the kinds of data taken into account. AlexNet and VGG-16 both modeled human similarity scores well, with correlations of 0.72, for pairs with matching subject and style. These models’ results also aligned quite well with human judgments for pairs with matching subject but different style, with VGG achieving a correlation of 0.48 and AlexNet achieving a correlation of 0.46. Without restricting the set of image pairs, AlexNet and VGG-16 achieved correlations of just 0.31 and 0.37, respectively. This suggests that neural networks are better able to model human style similarity scores when paintings’ subject matter is eliminated as a factor of consideration in the similarity judgment.