Abstract
Knowledge of the 3D structure of objects supports a variety of behaviors that animals depend on in their daily lives, from navigating through their environments to understanding their surroundings at every stop along the way. Multiple pictorial depth cues, including texture, boundary, shading and lighting, support the recovery of 3D shape information. Substantial work has gone into the development of computational models of 3D shape perception from each individual cue, but these efforts have yielded relatively little insight into the underlying neural computations. To investigate this question, we use a data-driven approach to train deep convolutional networks (DCNs) to learn to estimate local 3D surface orientations from 2D images of common objects as well as pseudo objects composed of primitive geometric shapes. We leverage modern computer graphics methods to generate large-scale near photo-realistic datasets of these stimuli under a large variety of viewing conditions (including multiple materials, light sources, viewpoints, etc), together with pixel-level 3D shape annotations and category labels (for common objects). We demonstrate that DCNs learn robust representations of surface orientations. We further investigate the tuning properties of receptive fields shaped by this learning process systematically – in an effort to characterize the underlying computational strategy used by the networks. Interestingly, we also find that object recognition accuracy is significantly improved when using 3D shape prediction as an auxiliary task while training for object categorization. These findings provide computational evidence for existing object recognition theories that highlight the role of surfaces for object recognition and provide empirical validation for a data-driven approach to modeling visual perception. Combined with the release of our large dataset of annotated images of 3D objects, we hope that these results will spur renewed interest for 3D approaches to object recognition from both the biological and computer vision communities.
Meeting abstract presented at VSS 2017