Abstract
View invariant object recognition requires both binding multiple 2D views of an object and discriminating among different objects that are highly similar. However, it is unclear what information is learned during unsupervised learning to enable this ability. It has been hypothesized that spatiotemporal continuity between views during learning may be key for binding objects views to a single mental representation. We investigated this hypothesis across four experiments testing subjects' ability to discriminate among novel 3D objects across rotation, before and after training under two conditions: sequential: subjects were presented with 24 views of an object spanning 180° in sequential order providing spatiotemporal continuity, and random: subjects were presented with the same views, but in random order. Subjects showed significant improvement after training in discriminating views of 3D objects rotated in the image plane (Experiment 1, n=14,ΔAccuracy=27.6±1.5%, ) or in depth (Experiment 2, n=20, ΔAccuracy=21.3±2.2%). Surprisingly, we found no differences in performance across sequential and random learning. In Experiment 3, we tested if implied motion serves as a cue to bind views by comparing training as before to training with masks placed between consecutive images reducing the implied motion. We found significant learning effects across all conditions (n=20, ΔAccuracy=21.0±3.4%), but no difference between masked and unmasked presentations. Finally, in Experiment 4 we tested subjects ability to generalize their learning to new object views. Subjects were trained with seven views spanning 180° and tested on untrained views interpolated between the trained views. Results revealed that sequential learning improved generalization performance significantly more than random learning (ΔAccuracy sequential=18.5±2.6%, ΔAccuracy random=9.2±2.6%, n=26). Overall, our data shows that spatiotemporal information during unsupervised learning is not necessary for view invariant recognition, but can lead to better generalization when training with a small number of views.
Meeting abstract presented at VSS 2014