Abstract
Humans and many newborn animals are able to effortlessly perceive objects and to segment them from other objects and the background, and a long-standing debate concerns the question of whether object segmentation is necessary for object recognition. While deep neural networks (DNNs) are state-of-the-art models of object recognition and representation, their performance in segmentation tasks is generally worse than in recognition tasks. For this reason, it is often believed that object segmentation and recognition are separate mechanisms of visual processing. Here, however, we show evidence that in variational autoencoders (VAEs), segmentation and faithful representation of data can be interlinked. VAEs are encoder-decoder models that learn to represent independent generative factors of the data as a distribution in a very small bottleneck layer - for example, when coding for a face, VAEs may empirically code for mouths and eyes independently. Specifically, we show that VAEs can be made to segment objects without any additional finetuning or downstream training. This segmentation is achieved with a procedure that we call the latent space noise trick: by perturbing the activity of the bottleneck units with activity-independent noise, and recurrently recording and clustering decoder outputs in response to these small changes, the model is able to segment and bind separate features together. We demonstrate that VAEs can group elements in a human-like fashion, are robust to occlusions, and produce illusory contours in simple stimuli. Furthermore, the model generalizes to the naturalistic setting of faces, producing meaningful subpart and figure-ground segmentation without ever having been trained on segmentation. For the first time, we show that learning to faithfully represent stimuli can be generally extended to segmentation using the same model backbone architecture without any additional training.