December 2022
Volume 22, Issue 14
Open Access
Vision Sciences Society Annual Meeting Abstract  |   December 2022
A Foveated Vision-Transformer Model for Scene Classification
Author Affiliations
  • Aditya Jonnalagadda
    UCSB
  • Miguel Eckstein
    UCSB
Journal of Vision December 2022, Vol.22, 4440. doi:https://doi.org/10.1167/jov.22.14.4440
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Aditya Jonnalagadda, Miguel Eckstein; A Foveated Vision-Transformer Model for Scene Classification. Journal of Vision 2022;22(14):4440. https://doi.org/10.1167/jov.22.14.4440.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Introduction: Humans can rapidly categorize scenes (R. VanRullen & S. Thorpe, 2001), even using peripheral vision (Larson & Loschky, 2009). Various computational models have been proposed for rapid scene categorization in terms of low-level properties such as spatial envelopes (Oliva & Torralba, 2001) and texture summary statistics (TTM, Rosenholtz et al., 2012). Yet, these models do not explicitly model the foveated properties of the visual system nor the interaction between eye movements and the scene category task. We propose a model with a foveated visual system and eye movements that can predict the dependence of human categorization performance across fixations. The model combines square pooling regions with the computer vision-transformer architecture (Dosovitskiy et al., 2020, Touvron et al., 2020) and makes multiple fixations to maximize classification using the technique of self-attention (Parikh et al., 2016, Bahdanau et al., 2015). Methods: Twenty-two participants classified 360 images (Places365 database, places2.csail.mit.edu) into 30 classes. Images subtended a viewing angle of 22.7 degrees. A gaze-contingent display was used to randomly interrupt the display after 1, 2, 3, or 4 fixations with initial forced-fixation at bottom-center or top-center. Results: We show that there is no significant improvement in performance after the 2nd fixation (Δ correct categorization=0.015; p=0.4729), unlike performance for object search (Koehler and Eckstein, 2017). The model correctly predicts modest classification improvements for free-viewing fixations (Δ=0.016). The model-human correlation in classification choices was not significantly lower than human-human correlations. Our findings suggest that human categorization of scenes within a single fixation can be explained by the spatially global distribution of the visual information in the scene and their availability even through the bottlenecks of the visual periphery. The newly proposed hybrid approach using biologically based modeling and Transformers can flexibly be applied to various naturalistic tasks and stimuli.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×