Abstract
Humans are able to partially reconstruct visual information, as evidenced by our ability to imagine and dream, yet it is debated whether a reconstruction process is functionally used for online visual perception. We focus on visual object recognition and propose that reconstruction creates initial hypotheses about an object’s shape and location, and serves as the attentional window effectively restricting visual encoding to the image region depicting the features needed for object recognition. To test this hypothesis, we built an iterative encoder-decoder system where an object reconstruction is generated from the decoder and then fed back to the encoder to mask the image region to be processed in the next step. We tested the model’s recognition performance on the challenging digit recognition task, MNIST-C, where 15 different types of corruption are applied to handwritten digit images. Our model outperformed other models that are especially designed to deal with out-of-distribution generalization, e.g., adversarially trained models. Ablation studies also confirmed that having an object reconstruction mask during encoding significantly increases model robustness compared to when the model just learns to reconstruct an object without utilizing it as a mask. Analyzing performance across the image corruption types in MNIST-C revealed that the object reconstruction mask is especially helpful for shape-oriented recognition, rendering the system more resilient to texture perturbations, e.g., an image embedded with fog or pepper/salt noise. One vulnerability of our method is evidenced by the (infrequent) cases when the initial object reconstruction is incorrect, leading to a reconstruction of the wrong object and a predicted visual hallucination. We discuss this problem and propose methods using the mismatch between the visual input and a reconstruction as an error signal to obtain even more robust and veridical object representations.