Abstract
A 3D object seen from different viewpoints can elicit vastly different retinal images. Differences between views depend on object geometry and initial pose, rendering relative pose estimation computationally challenging. Still, humans can easily judge object identity across views and estimate the relative pose between them. Here, we sought to measure how accurately observers can estimate pose similarity for 3D objects, and how these judgements are influenced by object geometry and changes in its retinal projection. We first mapped out human judgements of relative viewpoints using a multi-arrangement task. On each trial, observers (N=16) were asked to spatially arrange 31 views of one of three novel or three familiar 3D objects by viewpoint similarity. The resulting arrangements broadly matched ground-truth viewpoint differences with deviations that were consistent across observers (i.e. representational similarity analysis revealed correlations with ground-truth below the noise ceiling across objects). We implemented several candidate computational models, based on 2D image features or object geometry, and evaluated their ability to predict human judgements. Strategies using 2D features failed to account for human data. However, a metric based on the union and intersection of visible surface area across views (‘Surface IoU’) predicted human judgments on par with ground-truth. In order to maximise our power to differentiate between candidate strategies, we selected triads of viewpoints for individual objects over which pairs of models strongly disagreed (e.g. where similar changes in viewing angle produced very different changes in image pixels). We presented these triads in a two-alternative forced-choice experiment in which participants judged which of two views appeared closest to a target view. Across triad judgements and free arrangements, we gathered a rich dataset of human viewpoint perception for many objects and viewpoints that allows us to evaluate the ability of computational models to predict human strategies for judging relative viewpoint.