August 2023
Volume 23, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   August 2023
Evaluating physical scene understanding with objects consisting of different physical attributes in humans and machines
Author Affiliations
  • Hsiao-Yu Tung
    Massachusetts Institute of Technology
    Stanford University
  • Mingyu Ding
    University of Hong Kong
  • Zhenfang Chen
    MIT-IBM Watson AI Lab
  • Daniel Bear
    Stanford University
  • Chuang Gan
    University of Hong Kong
    MIT-IBM Watson AI Lab
  • Joshua Tenenbaum
    Massachusetts Institute of Technology
  • Daniel Yamins
    Stanford University
  • Judith Fan
    University of California San Diego
  • Kevin Smith
    Massachusetts Institute of Technology
Journal of Vision August 2023, Vol.23, 5622. doi:https://doi.org/10.1167/jov.23.9.5622
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Joshua Tenenbaum, Daniel Yamins, Judith Fan, Kevin Smith; Evaluating physical scene understanding with objects consisting of different physical attributes in humans and machines. Journal of Vision 2023;23(9):5622. https://doi.org/10.1167/jov.23.9.5622.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Human physical scene understanding requires more than simply localizing and recognizing objects — we can quickly adapt our predictions about how a scene will unfold by incorporating objects' latent physics properties, such as the masses of the objects in the scene. What are the underlying computational mechanisms that allow humans to infer these physical properties and adapt their physical predictions so efficiently from visual inputs? One hypothesis is that general intuitive physics knowledge can be learned from enough raw data, instantiated as computational models that predict future video frames in large datasets of complex scenes. To test this hypothesis, we evaluated how well two state-of-the-art video models — MCVD (Voleti et al., 2022) and ALOE (Ding et al., 2021) — could approximate human-level physical scene understanding. We measured both model and human performance on Physion++, a novel dataset and benchmark that rigorously evaluates visual physical prediction in humans and machines, under circumstances where accurate physical prediction relies on accurate estimates of the latent physical properties of objects in the scene. Specifically, we tested scenarios where accurate prediction relied on accurate estimates of objects' masses, and these mass values could only be inferred by observing how these objects moved and interacted with other objects and/or fluids. We found that MCVD, which explicitly predicts future states, achieved higher prediction accuracy (60%) than ALOE, which does not predict future states and performs near chance (53%). Yet MCVD’s predictions were not correlated with human predictions (r=0.02), and ALOE’s performance is only weakly correlated (r=0.2). These results show that current deep learning models that succeed in some settings nevertheless fail to achieve human-level physical prediction in other cases, especially those where latent property inference is required.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×