Abstract
Understanding the computational mechanisms that transform visual features into coherent object percepts requires the implementation of theories in scalable models. Here we report on implementations, using recent deep neural networks, of three previously proposed theories in which the binding of features is achieved (1) through convergence in a hierarchy of representations resulting in object-files, (2) through a reconstruction or a generative process that can target different features of an object, or (3) through the elevation of activation by spreading attention within an object via association fields. First, we present a model of object-based attention that relies on capsule networks to integrate features of different objects in the scene. With this grouping mechanism the model is able to learn to sequentially attend to objects to perform multi-object recognition and visual reasoning. The second modeling study shows how top-down reconstructions of object-centric representations in a sequential autoencoder can target different parts of the object in order to have a more robust and human-like object recognition system. The last study demonstrates how object perception and attention could be mediated by flexible object-based association fields at multiple levels of the visual processing hierarchy. Transformers provide a key relational and associative computation that may be present also in the primate brain, albeit implemented by a different mechanism. We observed that representations in transformer-based vision models can predict the reaction time behavior of people on an object grouping task. We also show that the feature maps can model the spreading of attention in an object.