Abstract
An essential feature of human visual perception is the segmentation of scenes into separable object representations. A rich body of work shows that the Gestalt principle of common fate plays an important role in this capability, both for the development of object perception in infants and as a grouping cue in the fully developed human visual system. Unlike humans, modern object learning methods in computer vision commonly rely on large-scale supervised training. Recently however, machine learning models have been proposed that learn to segment and individually represent objects in a scene without supervision. Here, we show that leveraging the Gestalt principle of common fate can improve these unsupervised object learning models. We build on an unsupervised motion segmentation algorithm that implements the principle of common fate by clustering pixels that exhibit similar motion. Those candidate segmentations are used to pretrain the components capturing the object appearances in recent scene models. Afterwards, we train the full models using the respective end-to-end training scheme. We show that the models quickly learn to detect instances of the objects that were initially identified by their motion, including in static frames where motion information is not available. Compared to the original training scheme, leveraging the common fate Gestalt principle results in much faster overall learning to detect and represent objects in the scenes. We revisit the psychophysics literature on perceptual grouping via common fate, to explore whether our common fate based learning scheme also results in an object notion that is closer to human perception.