Abstract
From a 200 msec masked presentation of a minimal scene, composed of two separated objects, one above the other, subjects can name both objects and report which one is on top. This capacity poses a challenge to feature hierarchy models (e.g., HMAX) which achieve translation invariant recognition by representing an object as a list of ‘positionless’ features. How are the features of each object kept separate, and how could their relative positions be known? Aggelopoulos & Rolls (2005) suggest a solution. They found that IT receptive fields shrink and shift when viewing scenes, allowing the identity and positions of multiple objects to be simultaneously encoded. Their results, however, leave open whether position is encoded retinotopically (implied by them) or in an object-relative manner. We tested this using a human fMRI-adaptation experiment.
While fixating, subjects were shown a 200ms S1 consisting of two separated objects (e.g., elephant over bus) followed by a 300ms blank and then a 200ms S2 defining one of four trial types: a) “Identical,” the same pair in the same relation and position, b) “Translated,” the same pair in the same relation but in a different position, c) “Relation,” the same pair in the same position but with the relation switched (e.g., bus over elephant), and d) “Object,” one of the previous manipulations where, in addition, one of the objects changed. The task was to detect Object trials.
Results: there was only a small release from adaptation for the Translated condition but a sizeable release for the Relations condition (the opposite ordering is predicted by retinotopic models). Thus LOC may encode an object-relative structural description of scenes. We hypothesize that the same neural mechanism, when engaged on a single object, could form a parts-based structural description supporting our ability to understand shape.
Supported by NSF BCS 04-20794, 04-266415, 05-31177, 06-17699