Abstract
Why do we enjoy viewing some scenes but not others? While researchers have found specific answers for what we enjoy seeing in terms of colors, shapes, objects, and even human faces and bodies, it is still largely elusive in terms of what we enjoy seeing in realistic scenes—views that encompass many different objects and features. Our study tackled this question by taking advantage of the latest success in computational models of human vision. We demonstrated that prototypicality—how typical a scene appears among others—is a part of the answer: We modeled scene prototypicality by averaging a scene’s similarity scores to a large reference set of other scenes based on embeddings extracted by a pretrained deep neural network, AlexNet. With this analysis performed separately for each layer, we discovered that the more prototypical high-level (layer Conv-5) visual representations are, the more aesthetically pleasing a scene consisted of inanimate content appears to human observers, namely an aesthetic prototype effect. This positive correlation was replicated in a different group of observers and with a different reference set of scenes. Critically, this correlation disappeared when the model’s trained weights were randomly permutated, indicating its specificity to human-like visual processing. These results suggest that high-level visual representations for inanimate scenes play an important role in forming the scene prototype and governing aesthetic preferences. Furthermore, this study also shows that human aesthetic preferences of realistic scenes are systematic, explainable, and reflect the organization of visual representations. At the same time, our method demonstrates a new way to explore aesthetics—besides investigating effects from specific features, we are also able to explore how the visual processing of realistic complex scenes interacts with aesthetic experiences.