Abstract
Is there a physicist in your visual cortex? Popular models of intuitive physics—our implicit understanding of physical contingencies in complex environments—posit the process of physical inference to be a richly structured simulation: a predominately cognitive process. In this study, we probe the possibility that at least certain aspects of our intuitive physics may be handled directly by computations in perceptual systems. Assuming deep neural networks to be a reasonable model of inferotemporal visual cortex, we employ a method of comparative psychophysics designed to gauge the similarity of human and machine judgments in a standard intuitive physics task: predicting the stability of randomly arranged block towers. We show that a convolutional neural network with comparable performance to human observers nonetheless differs in the variables that predict the specific choices it makes, variables we compute directly from the stimuli. Using these ‘features’ as the basis for an ideal observer analysis, we show that human behavior is best predicted by a feature that corresponds directly to the groundtruth stability of the tower, while the machine’s behavior is predicted by a less optimal feature. Training smaller, feedforward networks, we subsequently confirm that this divergence from human behavior is not the failure of any specific computation (e.g. an operation the network simply cannot perform), but of different feature biases. Simultaneously, we demonstrate that humans under time pressure tend to behave more like the neural network, their responses predicted by features that correlate less overall with the groundtruth stability of the tower. Taken together, these results suggest that at least some portion of the information processing involved in intuitive physics may be handled by the more feedforward elements of the visual system, but that further algorithmic, architectural or training modulations might be necessary to better model the perceptual processing of physical information more generally.