Abstract
Our experience of a beautiful, moving, or aversive image clearly evokes affective processes beyond vision, but the relative contributions of factors along the spectrum from input (image statistics) to ideation (abstract thought) remain a matter of debate. Machine vision systems, lacking both emotion and higher-order cognitive processes, provide an empirical testbed for isolating the contributions of a purely perceptual representation. How well can we predict human affective responses to an image from the purely perceptual response of a machine? Here, we address this question with a comprehensive survey of deep neural networks (e.g. ConvNets, Transformers, MLP-Mixers) trained on a variety computer vision tasks (e.g. vision-language contrastive learning, segmentation), examining the degree to which they can predict aesthetic judgment, arousal, and valence for images from multiple categories across two distinct datasets. Importantly, we use the features of these pre-trained models without any additional fine-tuning or retraining, probing whether affective information is immediately latent in the structure of the perceptual representation. We find that these networks have features sufficient to linearly predict (even with nonparametric mappings) average ratings of aesthetics, arousal, and valence with remarkably high accuracy across the board – at or near the predictions we would make based on the responses of the most representative ('taste-typical') human subjects. Models trained on object and scene classification, and modern contrastive learning models, produce the best overall features for prediction, while randomly-initialized models yield far lower predictive accuracies. Aesthetic judgments are the most predictable of the affective responses (followed by arousal, then valence), and we can predict these responses with greater accuracy for ‘taste-typical’ subjects than for less ‘taste-typical’ subjects. Taken together, these results suggest that the fundamental locus of visually evoked affective experience may be located more proximately to the perceptual system than abstract cognitive accounts of these experiences might otherwise suggest.