Abstract
View invariant object recognition requires binding multiple views of a single object while discriminating different objects from similar views. The capacity for invariant recognition likely arises in part through unsupervised learning, but it is unclear what visual information is used during learning, or what internal object representations are generated. In study 1, subjects were tested on their abilities to recognize novel 3-dimensional objects rotated in depth, before and after unsupervised learning when subjects saw the objects at a variety of angles. Objects were rendered in different visual formats: stereo, shape-from-shading, line drawings, and silhouettes, and trained/tested on the same format. Unsupervised learning produced significant recognition improvement in all conditions, but was substantially less effective for silhouettes than the other three formats (p< 0.01). A computational model of the ventral stream (Yamins, 2014) showed equal improvement across all formats in an intermediate V4-like stage, showing that the less-effective human learning for silhouettes cannot be attributed to a lack of visual information. However, a higher IT-like stage of the same model exhibited a learning curve pattern similar to that in humans. In study 2 we tested whether learning transfers across formats. Subjects participated in unsupervised learning of objects generated from shape-from-shading or line drawings and were tested on objects generated from the same or different cues. While subjects’ performance significantly improved after learning in all conditions, testing performance was better for shape-from-shading than line drawings irrespective of the learning cue (p< 0.02). We replicated these findings in a separate study training and testing with stereoscopic or line drawings of objects. Together these findings indicate that although contours can enable learning invariant representations, structural cues are more effective. Furthermore, our results suggest that learning optimizes internal representations to improve recognition of real 3D objects rather than simply generating associations among trained views.
Meeting abstract presented at VSS 2015