Abstract
Convolutional Neural Networks (CNNs) are known to have an inherent bias towards texture and reliance on high spatial frequency elements. These characteristics compromise their classification robustness. How might we incorporate global shape information in the classification pipeline of such networks to capture long-range dependencies? Our electrophysiological studies with human participants provide some clues. We devised an experiment involving high-density EEG measurements from ten participants exposed to low-spatial frequency, high-spatial frequency, and full-resolution images comprising objects and faces. Analyses revealed an unexpected temporal staggering of high versus low spatial frequencies. Decoding of neural information to infer stimulus identity was feasible earlier in the timeline with low spatial frequencies than with high spatial frequencies. These findings have helped us formulate an analogous strategy of spatial frequency decoupling and temporal staging in convolutional network architectures. We find that CNNs endowed with this biologically-inspired feature in their architectural bias demonstrate superior resilience against challenging scenarios, such as viewpoint changes and turbulence. Based on these results, we propose that a staggered feedforward processing sequence, progressing from low to high frequencies, may be an important property to boost network resilience and secure effective out-of-distribution generalization.