August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Local texture manipulation further illuminates the intrinsic difference between CNNs and human vision
Author Affiliations & Notes
  • Alish Dipani
    Department of Psychology, Northeastern University, Boston, MA
  • Huaizu Jiang
    Khoury College of Computer Sciences, Northeastern University, Boston, MA
  • MiYoung Kwon
    Department of Psychology, Northeastern University, Boston, MA
  • Footnotes
    Acknowledgements  This work was supported by NIH/NEI Grant R01EY027857, Northeastern University Tier-1 Seed grant, and Research to Prevent Blindness (RPB)/Lions Clubs International Foundation (LICF) low vision research award.
Journal of Vision August 2023, Vol.23, 4927. doi:https://doi.org/10.1167/jov.23.9.4927
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Alish Dipani, Huaizu Jiang, MiYoung Kwon; Local texture manipulation further illuminates the intrinsic difference between CNNs and human vision. Journal of Vision 2023;23(9):4927. https://doi.org/10.1167/jov.23.9.4927.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on a wide range of visual tasks and have provided currently the best computational models of visual processing in the primate brain. However, CNNs are strongly biased towards textures rather than shapes, which may harm object recognition. This is rather surprising as human vision benefits from textures and shapes as complementary and independent cues for recognizing objects (e.g., humans can reliably recognize an apple by either its texture or shape alone). Considering this stark contrast, it is important to understand how CNNs use the two cues, object texture and shape, for object recognition. Here we address this very question. To this end, we compare multi-label image-classification accuracy when the models are trained on either original (intact), object (local), or scene (global) texture-manipulated datasets. We then evaluate the models’ ability to generalize to other unseen datasets. We tested CNNs and Transformers known to have a strong shape bias. A psychophysical experiment was also conducted to evaluate human performance. We employ images from the COCO dataset containing natural scenes with multiple objects. Local textures are manipulated by replacing each object's texture with a random, artificial texture chosen from the DTD dataset. Global textures are manipulated by using image style transfer with a random texture. We find noticeable differences in the models’ ability to generalize to untrained datasets. Specifically, both CNNs and Transformers trained on the original dataset show a sharp decrease in accuracy when tested on texture-manipulated datasets. However, CNNs, but not Transformers, trained on local texture-manipulated datasets perform well on both the original and global texture-manipulated datasets. As expected, human observers show difficulty recognizing local texture-manipulated images. Our findings suggest that, unlike humans, CNNs do not use texture and shape independently. Instead, textures appear to be used to define object shape per se.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×