Abstract
To achieve behavioral goals, the visual system recognizes and processes the objects in a scene using a sequence of selective glimpses, but how is this attention control learned? Here we present an encoder-decoder model that is inspired by the interacting visual pathways making up the recognition-attention system in the brain. The encoder can be mapped onto the ventral ‘what’ processing, which uses a hierarchy of modules and employs feedforward, recurrent, and capsule layers to obtain an object-centric hidden representation for classification. The object-centric capsule representation feeds to the dorsal ‘where’ pathway, where the evolving recurrent representation provides top-down attentional modulation to plan subsequent glimpses (analogous to fixations) to route different parts of the visual input for processing (with the encoding and decoding steps taken iteratively). We evaluate our model on multi-object recognition (highly overlapping digits, digits among distracting clutter) and visual reasoning tasks. Our model achieved 95% accuracy on classifying highly overlapping digits (80 percent overlap between bounding boxes) and significantly outperforms the Capsule Network model (<90%) trained on the same dataset while having a third of the number of parameters. Ablation studies show how recurrent, feedforward and glimpse mechanisms contribute to the model performance in this task. In a same-different task (from the Synthetic Visual Reasoning Tasks benchmark), our model achieved near-perfect accuracy (>99%), similar to ResNet and DenseNet models (outperforming ALexNet, VGG and CORnets) on comparing two randomly generated objects. On a challenging generalization task where the model is tested on stimuli that are different from the training set, our model achieved 82% accuracy outperforming bigger ResNet models (71%), demonstrating the benefit of a contextualized recurrent computation paired with an object-centric attention mechanism glimpsing the objects. Our work takes a step towards more biologically plausible architectures by integrating recurrent object-centric representation with the planning of attentional glimpses.