Purchase this article with an account.
Martin A. Giese, Vittorio Caggiagno, Falk Fleischer; View-based neural encoding of goal-directed actions: a physiologically inspired neural theory. Journal of Vision 2010;10(7):1095. doi: https://doi.org/10.1167/10.7.1095.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
The visual recognition of goal-directed movements is crucial for action understanding. Neurons with visual selectivity for goal-directed hand actions have been found in multiple cortical regions. Such neurons are characterized by a remarkable combination of selectivity and invariance: Their responses vary with subtle differences between hand shapes (defining different grip types) and the exact spatial relationship between effector and goal object (as required for a successful grip). At the same time, many of these neurons are largely invariant with respect to the spatial position of the stimulus and the visual perspective. This raises the question how this combination of spatial accuracy and invariance is accomplished in visual action recognition. Numerous theories in neuroscience and robotics have postulated that the visual system reconstructs the three-dimensional structure of effector and object and then verifies their correct spatial relationship, potentially by internal simulation of the observed action in a motor frame of reference. However, novel electrophysiological data, showing view-dependent responses of mirror neurons, suggest alternative explanations. METHODS: We propose a novel theory for the recognition of goal-directed hand movements that is based on physiologically plausible mechanisms, and which makes predictions that are compatible with electrophysiological data. It is based on the following key components: (1) A neural shape recognition hierarchy with incomplete position invariance; (2) a dynamic neural mechanism that associates shape information over time; (3) a gain-field-like mechanism that computes affordance- and spatial matching between effector and goal object; (4) pooling of the output signals of a small number of view-specific action-selective modules. RESULTS: We show that this model is computationally powerful enough to accomplish robust position- and view-invariant recognition on real videos. It reproduces and predicts correctly data from single-cell recordings, e.g. on the view- and temporal–order selectivity of mirror neurons in area F5.
This PDF is available to Subscribers Only