Abstract
Transformers have recently achieved state-of-the-art performance in many domains, including in solving the problems of object detection and grouping in images. However, despite their success, transformers are controversial as models of the human brain. Importantly, these models have not been shown to capture the way humans group and perceive objects. Here we explore the potential for the attention mechanism in transformers to map onto the dynamics of human object-based attention and grouping. We probe the mechanisms of object-based attention using a two-dot paradigm, where two markers are placed on an image and the task is to indicate whether they are on the same or different objects. Previous related work found that human reaction time in this task varies with the difficulty of object grouping and the spread of attention within an object. Our model first processes images through a convolutional neural network and then through a transformer network to find the self-attention weights between different pieces of the image, each represented by a token. The model then “spreads” attention through self-attention weights by starting from the first marker location and accessing all the tokens having strong self-attention weights to the starting-marker token, thus supporting the unrestricted spread of attention in the selection of the other tokens. The token closest to the second marker is then selected from among all the strongly connected tokens, corresponding to the hypothesized active spreading of attention in the two-dot task. This process repeats until attention spreads to the second token. We show that the model predicts subjects’ reaction time as estimated by the number of steps taken in the image-dependent attention spread. Our work shows that the dynamically formed self-attention connections in transformers have a role similar to that of feedback and lateral connections in the spread of object-based attention in human vision.