Abstract
An intrinsic part of seeing objects is seeing how similar or different they are relative to one another. This experience requires that objects be mentally represented in a common format over which such comparisons can be carried out. What is that representational format? Objects could be compared in terms of their superficial features (e.g. degree of pixel-by-pixel overlap), but a more intriguing possibility is that objects are compared according to a deeper structure. Here we explore this possibility, by asking whether visual similarity is computed using an object's shape skeleton (in particular its medial axis) — a geometric transformation that extracts an object's inferred underlying structure. Such representations have proven useful in computer vision research, but it remains unclear how much they actually matter for human visual performance. The present experiments investigated this question. Two spindly shapes appeared side-by-side, and observers simply indicated whether the shapes were the same or different. Crucially, the two shapes could vary either in their underlying skeletal structure (rather than in superficial features such as size, orientation, or internal angular separation), or instead in large surface-level ways (without changing overall skeletal organization). Discrimination was better for skeletally dissimilar shapes: observers could tell shapes apart more accurately when they had different skeletons, compared to when they had objectively larger differences in superficial features but retained the same skeletal structure. Conversely, observers had difficulty appreciating even surprisingly large differences when those differences did not reorganize the underlying skeletons. Additional experiments generalized this pattern to realistic 3D volumes whose skeletons were much less readily inferable from the shape's visible contours: skeletal changes were still easier to detect than all other kinds of changes. These results show how shape skeletons may underlie the perception of similarity — and more generally how they have important consequences for downstream visual processing.
Meeting abstract presented at VSS 2017