December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
Global information processing in feedforward deep networks
Author Affiliations & Notes
  • Ben Lonnqvist
    Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
  • Alban Bornet
    Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
  • Adrien Doerig
    Donders Institute for Brain, Cognition & Behaviour, Nijmegen, Netherlands
  • Michael H. Herzog
    Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
  • Footnotes
    Acknowledgements  BL was supported by the Swiss National Science Foundation grant n. 176153 "Basics of visual processing : from elements to figures".
Journal of Vision December 2022, Vol.22, 3212. doi:https://doi.org/10.1167/jov.22.14.3212
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Ben Lonnqvist, Alban Bornet, Adrien Doerig, Michael H. Herzog; Global information processing in feedforward deep networks. Journal of Vision 2022;22(14):3212. https://doi.org/10.1167/jov.22.14.3212.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

While deep neural networks are state-of-the-art models of many parts of the human visual system, here we show that they fail to process global information in a humanlike manner. First, using visual crowding as a probe into global visual information processing, we found that regardless of architecture, feedforward deep networks successfully model an elementary version of crowding, but cannot exhibit its global counterpart (“uncrowding”). It is not yet well-understood whether this limitation could be ameliorated by substantially larger and more naturalistic training conditions, or by attentional mechanisms. To investigate this, we studied models trained with the CLIP (Contrastive Language-Image Pretraining) procedure, which is a training procedure for a set of attention-based models intended for zero-shot classification of images. CLIP models are trained by self-supervised pairing of generated labels with image inputs on a composite dataset of approximately 400 million images. Due to this training procedure, CLIP models have shown to exhibit highly abstract representations, state-of-the-art performance in zero-shot classification, and to make classification errors that are more in line with the errors humans make than previous models. Despite these advances, we show, by fitting logistic regression models to the activations of layers in CLIP models, that training procedure, architectural differences, nor training dataset size can ameliorate feedforward networks’ inability to reproduce humanlike global information processing in an uncrowding task. This highlights an important aspect of visual information processing: feedforward computations alone are not enough to explain how visual information in humans is combined globally.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×