Abstract
Vision does not merely detect and recognize patterns and contours, but makes rich inferences about objects and agents including their three-dimensional (3D) shapes and configurations. Current modeling approaches based on deep convolutional neural network (DCNN) classifiers can explain aspects of neural processing, but they do not address how perception can be so rich and they typically do not provide an interpretable functional account of neural computation. To address these shortcomings, we take a different approach based on “efficient inverse graphics” (EIG), instantiating the hypothesis that the goal of visual processing is to invert generative models of how 3D scenes form and project to images. Instead of classification, we use DCNNs to build inference networks that invert generative scene models. EIG meets the functional goal of quickly computing rich 3D percepts and provides an interpretable reverse-engineering account of biological computation in the language of objects and generative models. We tested this approach in body perception: Two macaques, EIG and state-of-the-art DCNN classifiers saw images of monkey bodies that varied in shape, posture and viewpoint. EIG is designed to recover these variables from the images in an articulated 3D generative model, whereas the classifiers discriminated between object identities or body postures. Using representational similarity analysis, we compared layer activations of EIG to population-level activity obtained from single-cell recordings in body-selective regions of the inferotemporal cortex. Similarity matrices arising from the EIG layers were highly correlated with the neural similarity matrices. EIG explained neural activity significantly better than the classification networks (p<.05), which additionally failed to reproduce the key qualitative patterns observed in the data. These results provide an integrated account of how in the ventral stream, raw sense inputs are transformed into percepts of objects and agents, spanning the neural and cognitive levels of analysis, through the computation of inverse graphics.