September 2021
Volume 21, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2021
Learning to see material from motion by predicting videos
Author Affiliations & Notes
  • Katherine Storrs
    Department of Experimental Psychology, Justus Liebig University Giessen
  • Roland Fleming
    Department of Experimental Psychology, Justus Liebig University Giessen
  • Footnotes
    Acknowledgements  This work was funded by the Alexander von Humboldt Foundation.
Journal of Vision September 2021, Vol.21, 1993. doi:https://doi.org/10.1167/jov.21.9.1993
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Katherine Storrs, Roland Fleming; Learning to see material from motion by predicting videos. Journal of Vision 2021;21(9):1993. https://doi.org/10.1167/jov.21.9.1993.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Despite the impressive achievements of supervised deep neural networks, brains must learn to represent the world without access to ground-truth training data. We propose that perception of distal properties arises instead from unsupervised learning objectives, such as temporal prediction, applied to proximal sensory data. To test this, we rendered 10,000 videos of objects moving with random rotational axis, speed, illumination, and reflectance. We trained a four-layer recurrent “PredNet” network to predict the pixels of the next frame in each video. After training, object shape, material, position, and illumination could be decoded for new videos by taking linear combinations of unit activations. Representations were hierarchical, with scene properties better estimated from deep than shallow layers. Visualising single “neurons” revealed selectivity for distal features: a “shadow unit” in layer 4 responds exclusively to image locations containing the object’s shadow, while a “reflectance edge” unit in layer 3 tracks image edges caused by reflectance changes. Material decoding was higher for moving than static objects, and increased over the first five frames, demonstrating that the model is sensitive to motion features disambiguating reflective from textured surfaces. To test whether these features are similar to those used by humans, we rendered test stimuli depicting reflective objects that were either static, moving, or moving with “reflections” fixed to their surface. All conditions had near-identical static image properties, but motion cues in the latter conditions give rise to glossy vs matte percepts, respectively. Model-predicted gloss agreed with human judgements of the relative glossiness of all stimuli. Our results suggest unsupervised deep learning discovers motion cues to material similar to those represented in human vision, and provides a framework for understanding how brains learn rich scene representations without ground-truth world information.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×