Abstract
The human visual system perceives the environment by integrating multiple eye-fixations at different locations of the scene. For each eye-fixation, there is lower perceived resolution at the periphery and higher at the center of the visual field due to the receptive field size of the neurons of the retina and early visual cortex increasing with eccentricity from the fixation point to the periphery. The eccentricity dependence of the receptive field size has been argued to allow invariance to scale and background clutter in the vision system for object recognition, whereas the eye-fixation mechanism provides invariance to the object position. To further test this hypothesis, we propose a novel computational approach that integrates Eccentricity Dependent Neural Network (ENN) with Recurrent Attention Model (RAM). ENN, a recently introduced computational model of the visual cortex, processes the input at different scales, with receptive field sizes that change with eccentricity at multiple scale channels. This incorporates intrinsic scale invariance property into the model. RAM has an attention mechanism using Reinforcement Learning, which learns to fixate on different parts of the visual input at different time steps. When combined, RAM finds the best location to fixate on at each time step, then use the location as the center of the input in ENN. We conducted extensive experiments using MNIST dataset, where images of digits are trained and tested at different scales and positions to compare the proposed system, ENN-RAM, to the original RAM. Our experiment results reveal that with less training data used, ENN-RAM model is able to generalize to a different scale, i.e., it recognizes objects at scales different from the learned scales. We also observe that the new ENN-RAM is resistant to clutter when trained without such clutter, whereas vanilla RAM is not.
Acknowledgement: This work was funded by the MOE SUTD SRG grant (SRG ISTD 2017 131) and CBMM NSF STC award CCF-1231216.