December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
Purely Perceptual Machines Robustly Predict Human Visual Arousal, Valence, and Aesthetics
Author Affiliations & Notes
  • Colin Conwell
    Harvard University
  • Daniel Graham
    Hobart and William Smith Colleges
  • Talia Konkle
    Harvard University
  • Edward Vessel
    Max Planck Institute for Empirical Aesthetics
  • Footnotes
    Acknowledgements  We thank George Alvarez and Gabriel Kreiman for comments and feedback.
Journal of Vision December 2022, Vol.22, 4266. doi:https://doi.org/10.1167/jov.22.14.4266
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Colin Conwell, Daniel Graham, Talia Konkle, Edward Vessel; Purely Perceptual Machines Robustly Predict Human Visual Arousal, Valence, and Aesthetics. Journal of Vision 2022;22(14):4266. https://doi.org/10.1167/jov.22.14.4266.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Our experience of a beautiful, moving, or aversive image clearly evokes affective processes beyond vision, but the relative contributions of factors along the spectrum from input (image statistics) to ideation (abstract thought) remain a matter of debate. Machine vision systems, lacking both emotion and higher-order cognitive processes, provide an empirical testbed for isolating the contributions of a purely perceptual representation. How well can we predict human affective responses to an image from the purely perceptual response of a machine? Here, we address this question with a comprehensive survey of deep neural networks (e.g. ConvNets, Transformers, MLP-Mixers) trained on a variety computer vision tasks (e.g. vision-language contrastive learning, segmentation), examining the degree to which they can predict aesthetic judgment, arousal, and valence for images from multiple categories across two distinct datasets. Importantly, we use the features of these pre-trained models without any additional fine-tuning or retraining, probing whether affective information is immediately latent in the structure of the perceptual representation. We find that these networks have features sufficient to linearly predict (even with nonparametric mappings) average ratings of aesthetics, arousal, and valence with remarkably high accuracy across the board – at or near the predictions we would make based on the responses of the most representative ('taste-typical') human subjects. Models trained on object and scene classification, and modern contrastive learning models, produce the best overall features for prediction, while randomly-initialized models yield far lower predictive accuracies. Aesthetic judgments are the most predictable of the affective responses (followed by arousal, then valence), and we can predict these responses with greater accuracy for ‘taste-typical’ subjects than for less ‘taste-typical’ subjects. Taken together, these results suggest that the fundamental locus of visually evoked affective experience may be located more proximately to the perceptual system than abstract cognitive accounts of these experiences might otherwise suggest.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×