Abstract
Many views of the world are cluttered with multiple kinds of objects present, but at any given moment only a subset of this information may be task-relevant. Top-down attention can direct visual encoding based on internal goals, e.g. when looking for keys, attention mechanisms select and amplify the relevant key-like image statistics, aiding detection and modulating the gain across the visual hierarchy. Motivated by visual cognition and visual neuroscience findings, we designed long-range modulatory feedback pathways to outfit deep neural network models, with learnable channel-to-channel influences between source and destination layers that spatially broadcast feature-based gain signals. We trained a series of Alexnets with varying feedback pathways on 1000-way ImageNet classification to be accurate on both their feed-forward and modulated pass. First, we show that models equipped with these feedback pathways naturally show improved image recognition, adversarial robustness, and emergent brain-alignment, relative to baseline models. Critically, the final layer of these models can serve as a flexible communication interface between visual and cognitive systems, where cognitive-level goals (e.g. “key?”) can be specified as a vectors in the output space, and naturally leverage feedback projections to modulate earlier hierarchical processing stages. We compare and identify the effective ways to ‘cognitively steer’ the model based on prototype representations, which dramatically improve recognition of categories in composite images of multiple categories, succeeding where baseline feed-forward models fail. Further, these models recapitulate neural signatures of category-based attention—e.g. showing modulation of face and scene selective units inside the model when attending to either faces or scenes, when presented with a fixed face-scene composite image. Broadly, these models offer a mechanistic account of top-down category-based attention, demonstrating how long-range modulatory feedback pathways can allow different goal states to make flexible use of fixed visual circuity, supporting dynamic goal-based routing of incoming visual information.