Abstract
Recent methods for optical flow estimation achieve remarkable precision and are successfully applied in downstream tasks such as segmenting moving objects. These methods are based on matching deep neural network features across successive video frames. For humans, in contrast, the dominant motion estimation mechanism is believed to rely on spatio-temporal energy filtering. Here, we compare both motion estimation approaches for segregating a moving object from a moving background. We render synthetic videos based on scanned 3d objects and backgrounds to obtain ground truth motion for realistic scenes. We transform the videos by replacing the textures with random dots that follow the motion of the original video. This way, each individual frame does not contain any other information about the object apart from the motion signal. Humans have been shown to be able to use random dot motion for recognizing objects in these stimuli (Robert et al. 2023). We compare segmentation methods based on the recent RAFT optical flow estimator (Teed and Deng 2020) and the spatio-temporal energy model of Simoncelli & Heeger (1998). Our results show that the spatio-temporal energy approach works almost as well as using RAFT for the original videos when combined with an established segmentation architecture. Furthermore, we quantify the amount of segmentation information that can be decoded from both models when using the optimal non-negative superposition of feature maps for each video. This analysis confirms that both optic flow representations can be used for motion segmentation while RAFT performs slightly better for the original videos. For the random dot stimuli however, hardly any information about the object can be decoded from RAFT while the brain-inspired spatio-temporal energy filtering approach is only mildly affected. Based on these results we explore the use of spatio-temporal filtering for building a more robust model for moving object segmentation.