Abstract
Our visual world and its perception are dynamic. Rapid serial visual presentation (RSVP) — a task in which observers see rapid sequences of natural scenes — is an example of such dynamic sequential visual stimulation. Remarkably, humans are still able to recognise scenes when images are shown as briefly as 13 ms/image. This feat has been attributed to the computational power of the first feedforward sweep in sensory processing. In contrast, slower presentation durations (linked to better performance) have been suggested to increasingly engage recurrent processing. Yet, the computational mechanisms governing human sequential object recognition remain poorly understood. Here, we developed a class of deep learning models capable of sequential object recognition. Using these models, we compared different computational mechanisms: feedforward and recurrent processing, single and sequential image processing, as well as different forms of rapid sensory adaptation. We evaluated how these mechanisms perform on an RSVP task, and to what extent they explain human behavioural patterns (N=36) across varying presentation durations (13, 40, 80 ms/image). We found that only models that integrate images sequentially via lateral recurrence captured human performance levels across different presentation durations. Such sequential models also displayed a temporal correspondence to single-trial performance, with few model steps best explaining human behaviour for the fastest durations and vice versa. Importantly, this temporal correspondence was achieved without reducing the model’s overall explanatory power. Finally, augmenting this sequential model with a power-law adaptation mechanism was essential to provide a plausible account of how neural processing obtains informative representations based on the briefest visual stimulation. Taken together, these results shed new light on how local recurrence and adaptation jointly enable object recognition to be as fast and effective as required by a dynamic visual world.