Abstract
Deep convolutional networks (DCNs) have been shown to match human visual ability in various tasks including object classification and segmentation. Nevertheless, DCNs still struggle to match our ability for abstract visual reasoning. Visual relations can broadly be divided into categorical and directional relations. Previous work has investigated the ability of DCNs to solve categorical visual reasoning (CVR) problems. For instance, it has been found that DCNs can solve spatial reasoning tasks such as determining whether objects are vertically or horizontally arranged much more efficiently than same-difference tasks such as determining whether two objects are same or different. Here, we explore another class of visual reasoning problems known as directional visual relations (DVR), where the order of each object in a relation matters. For instance, a visual scene of “a baby on a blanket” would be different from a scene of “a blanket on a baby.” We hypothesized that attention and working memory are needed to solve these tasks and because these functions are lacking in DCNs, DCNs would be limited in their ability to learn to solve these tasks. First, we studied how DCNs learn to solve DVR tasks, judging whether a target object is to the left vs. right and to the bottom vs. top of a reference object. DCNs struggle to learn directional visual relations when stimulus variability makes rote memorization difficult. Extending a DCN architecture to incorporate attention and memory yield a model that solves the task on par with human judgments. Altogether, our findings suggest that feedforward processing alone is insufficient to solve DVR and that attention and working memory are crucial for modeling how the brain solves DVR.