Abstract
Human physical scene understanding requires more than simply localizing and recognizing objects — we can quickly adapt our predictions about how a scene will unfold by incorporating objects' latent physics properties, such as the masses of the objects in the scene. What are the underlying computational mechanisms that allow humans to infer these physical properties and adapt their physical predictions so efficiently from visual inputs? One hypothesis is that general intuitive physics knowledge can be learned from enough raw data, instantiated as computational models that predict future video frames in large datasets of complex scenes. To test this hypothesis, we evaluated how well two state-of-the-art video models — MCVD (Voleti et al., 2022) and ALOE (Ding et al., 2021) — could approximate human-level physical scene understanding. We measured both model and human performance on Physion++, a novel dataset and benchmark that rigorously evaluates visual physical prediction in humans and machines, under circumstances where accurate physical prediction relies on accurate estimates of the latent physical properties of objects in the scene. Specifically, we tested scenarios where accurate prediction relied on accurate estimates of objects' masses, and these mass values could only be inferred by observing how these objects moved and interacted with other objects and/or fluids. We found that MCVD, which explicitly predicts future states, achieved higher prediction accuracy (60%) than ALOE, which does not predict future states and performs near chance (53%). Yet MCVD’s predictions were not correlated with human predictions (r=0.02), and ALOE’s performance is only weakly correlated (r=0.2). These results show that current deep learning models that succeed in some settings nevertheless fail to achieve human-level physical prediction in other cases, especially those where latent property inference is required.