Abstract
Humans sense visual information from the world with more spatial resolution at the center of gaze and progressively less spatial resolution in the periphery. Why? One explanation is that the number of retinal sensors is limited by biological constraints (e.g. related to the size of the optic nerve). However, it is also possible that spatially-varying samples of the visual world may confer a functional advantage for visual processing (e.g. robustness to occlusions). Here, we test this idea computationally using deep convolutional neural networks.
We generated “foveated” versions of scene images, preserving the image resolution at the center of “gaze” and emulating progressive crowding in the periphery (Deza et al., 2019). Then, we trained an ensemble of models (n=5, AlexNet and ResNet18 architectures) to do 20-way scene categorization. One set of models was trained on foveated images, with varying points of fixation on the image—a form of natural augmentation by eye movements (Human-aug-nets). A second set of models was trained on full resolution images, with typical artificial augmentation procedures which crop, rescale, and randomly mirror the images (Machine-aug-nets). Finally, we assessed each model’s robustness to 4 types of occlusion: vertical, horizontal, glaucoma, scotoma.
We found that all networks had similar ability to classify scenes after training. However, Human-aug-nets were additionally robust to all forms of occlusion relative to Machine-aug-nets. Intriguingly, we found that Human-aug-nets were by far more robust to scotomal type of occlusion (foveal information removed) than machine augmented networks. These findings suggest that the local texture statistics captured in peripheral visual computations may be important for robustly encoding scene category information. Broadly, these results provide computational support for the idea that the foveated nature of the human visual system confers a functional advantage for scene representation.