Abstract
Background: Deep network models are relatively successful at predicting both performance and neural activations on object recognition tasks. However, recent work suggests that these models rely largely on local features rather than global shape, whereas humans, while sensitive to local 2D shape (curvature), are easily able to discriminate natural shapes from synthetic shapes with matched curvature statistics. Here we assess two alternative shape models that could account for this human sensitivity to 2D shape beyond local curvature: 1) Pooling – shape information is pooled over a collection of independently-coded fragments or parts; 2) Configural – the representation depends on the arrangement of these parts over the entire shape. Method: We employed a dataset of 2D animal shapes approximated as 120-segment outline polygons and local ‘metamers’ – closed contours that match the local curvature statistics of the animal shapes. In a two-interval task, five observers discriminated between a stimulus containing only animal contour fragments and a second stimulus containing only metamer fragments, while the length of the fragments was varied from 2 segments (local) to 120 segments (global). There were two conditions: 1. A single fragment displayed centrally. 2. Multiple fragments displayed within a 7.5 deg circular window. The number of fragments was selected to yield a total of 120 turning angles, matching the full-shape condition. Results: For both single- and multi-fragment conditions, performance rises from chance to near 100% as fragment length increases from 2 to 120, reflecting human sensitivity to 2D shape beyond local curvature. Interestingly, there is little difference in the psychometric functions for the single- and multi-fragment conditions (75%-correct thresholds of 24 +/- 7 vs 18 +/- 6 segments), indicating very little pooling across fragments. This suggests that human shape perception is highly configural, posing a challenge to recent deep learning accounts of object coding.