The primary goal of visual encoding is to determine the nature and motion of the objects in the surrounding environment. In order to plan and coordinate actions, we need a functional representation of the three-dimensional (3D) scene layout and of the spatial and depth configuration of the objects within it. The visual information provided to each eye is, however, two-dimensional (2D), and the 2D configurations of objects in the visual array have an entirely different metric structure from that of the spatial configuration of the visual cues that convey the presence of objects to the brain, or to artificial sensing systems and share none of the physical properties constituting the objects. In general, the visual cues may change in luminance or color, or they may be disrupted by reflections or occlusion by intervening objects. The particular cues such as edge structure, binocular disparity, color, shading, texture, and motion vector fields may carry discordant information about different aspects of an object. Importantly, many of these cues may be sparse, with missing information about the object structure across occlusions or gaps where there are no edge or texture cues to carry information about the object shape.
Thus, a primary requirement of neural or computational representations of the shape of objects is the reconstruction of the 3D configuration and filling-in of its depth surfaces across regions of missing or discrepant information in the local visual cues. Computational approaches to the issue of the structure of objects tend to take either low-level or high-level approaches to the problem. Low-level approaches begin with local feature recognition and attempt to build up the object representation by hierarchical convergence, using primarily feedforward logic with some recurrent feedback tuning of the results (e.g., Marr,
1982; Grossberg, Kuhlmann, & Mingolla,
2007). High-level, or Bayesian, approaches begin with the vocabulary of likely object structures and look for evidence in the visual array as to which object might be there (e.g., Huang & Russell,
1998; Rue & Hurn,
1999; Moghaddam,
2001; Stormont,
2007). Both approaches work well for objects with a stable 2D structure (as in a typical laboratory setup), but are easily confused when viewing a complex 3D scene with poorly-defined depth cues (such as a dinner table with transparent glasses and white plates). To manipulate such objects under visual control, it is therefore of critical importance in full-fledged visual representation to provide reconstructions that complete the 3D structure of the relevant surfaces of objects in the visual world.