In this study we have investigated possible mechanisms for the robust generalization from normal (full-body) articulated motion stimuli to point-light stimuli. We have presented multiple pieces of evidence suggesting that the detection of critical mid-level optic flow features within a specific coarse spatial arrangement might form the basis of this generalization: (I)Normal and point-light stimuli share very similar dominant mid-level optic flow features; (II) the presence of these features with the appropriate spatial arrangement induces the percept of a person walking, even though the stimuli donot comply with the kinematics of the human body; and (III) a neural model that exploits these critical features achieves substantial recognition rates, even for degraded point-light stimuli.
Our results seem to contradict a recent psychophysical study (Beintema & Lappe,
2002) that concludes that the motion information in the SPS is so dramatically degraded that its recognition must be based on the reconstruction of body shape. However,a more detailed statistical analysis seems to disprove this assumption.
The amount of local motion information in the SPS can be quantified using an
index of motion quality (c.f. Beintema & Lappe,
2002). This quantity was defined as the fraction of dots in the SPS whose motion remains within the 10% range of the veridical motion vectors that would be valid if the dots were not randomly displaced on the skeleton. We computed this index in three different ways: (1) For the full 2D-motion vectors, (2) for the vertical motion components only, and (3) for the horizontal motion components only. In agreement with the study by Beintema and Lappe, we found for the full 2D-motion vectors that less than 2% of the dots remained in the 10% range of the veridical vectors. The same was true if we regarded only the vertical motion components (< 2%). However, the index of motion quality for the horizontal motion components was much higher (7%), indicating a substantially higher amount of horizontal motion information. Our simulation study confirms that this residual horizontal motion information can be exploited for achieving substantial recognition rates, which are close to psychophysical data if atleast 4 dots are present in the stimulus. The asymmetric degradation of horizontal and vertical motion components can be easily understood considering the fact that the limbs of a walker are predominantly vertically oriented. A separate analysis of horizontal and vertical motion components seems physiologically feasible by reading out separately neural ensembles (e.g., in area MT), which are tuned to different preferred directions.
Our model postulates the existence of neural detectors for opponent motion within adjacent receptive subfields. One might ask if this assumption matches experimental data about motion-selective neurons in the brain. Physiological studies, for example in area MT, have revealed a subpopulation of neurons that have receptive fields with antagonistic surrounds. Some of these neurons show enhanced responses if the direction of the movement in the surround is opposite to the direction of the movement in the center (Allman, Miezin, & McGuinness,
1985; Born,
2000). Opponent motion provides an adequate stimulus for such neurons. In addition, neurons that respond selectively to motion discontinuities have also been found in other areas (e.g.,V1 and V2; Marcar, Raiguel, Xiao, & Orban,
2000; Reppas, Niyogi, Dale, Sereno, & Tootell,
1997). In monkey area MT it seems that neurons with reinforcing and antagonistic surrounds form separate populations (Born,
2000; Born & Tootell,
1992), suggesting that they might subserve computationally different functions. It is obvious that neurons with non-antagonistic surrounds are suitable for estimating smooth optic flow. The computational role of the neurons with antagonistic surrounds is less clear, and several hypotheses have been discussed (e.g., segmentation of moving objects from the background, the processing of relative motion, or motion parallax). Our study suggests that such neural detectors might also be useful for the processing of biological motion.
Detectors for motion discontinuities, similar to the ones postulated by our model, might also be useful for solving the aperture problem in complex visual scenes. Computational studies show that it is important for the solution of the aperture problem in scenes with multiple moving objects to prevent a combination or smoothing of local motion information across object boundaries (Koch, Marroquin, & Yuille,
1986; Liden & Pack,
1999). Opponent motion detectors may be important for detecting such discontinuities.
Our psychophysical results show that the combination of opponent motion with very coarse positional information is sufficient to induce the percept of a moving person, even in completely naïve subjects. Indeed, the CFS was purposefully designed to minimize other cues. For the detection of a moving human, this limited amount of information seems to be sufficient. for more sophisticated tasks, like identification of gender or emotional content, more detailed information might be required. However, it has also been shown that fine discrimination tasks, like people identification by gait, can be based purely on local motion information (e.g., Giese & Poggio,
2003). In addition, the quantitative comparison between CFS and SPS shows that the detailed form information provided by the SPS does not seem to improve the recognition of walking direction.
The high similarity of the extracted mid-level optic flow features for normal and point-light stimuli was rather unexpected, given that point-light walkers specify a much sparser optic flow field. Even though this study has focused on walker stimuli, the proposed statistical method for the extraction of dominant form and optic flow features applies to any other complex motion stimulus. As an example, we have designed a similar CFS for running ([
demo provided with this article).
The importance of motion information for the recognition of biological motion has been pointed out by many previous studies (Mather & Murdoch,
1994; Mather et al.,
1992; Troje,
2002). However, the exact nature of the underlying motion features has not so far been clarified, nor have methods been proposed in the psychophysical literature that would allow an identification of such critical features. The detection of mid-level optic flow features with relatively coarse spatial localization provides an elegant explanation for the generalization from normal to point-light stimuli, and also to strongly degraded stimuli like the SPS or the CFS. This explanation seems appealing because it does not require complex computational mechanisms and, in principle, can be implemented with relatively simple neural circuits.
An alternative, although in our view less likely explanation of our results, is a recognition of degraded stimuli based on mechanisms that reconstruct missing information about the body shape (e.g., by fitting articulated models or shape templates to the point positions) (see Giese,
in press). A large body of work incomputer vision (Aggarwal & Cai,
1999; Curio & Giese,
2005; Gavrila,
1999) shows that such a reconstruction of missing form information from degraded stimuli is possible in principle. However, most of the existing methods are computationally quite expensive. Algorithms that are based on explicit articulated models typically require the solution of high-dimensional nonlinear optimization and search problems because the position, scaling, and posture of the model are a priori unknown. In addition, the postures specified by monocular visual stimuli are often not unique, requiring methods for multi-hypothesis tracking. A particularly difficult problem is the fitting of articulated shape models in the presence of motion clutter. Psychophysical experiments have shown that biological motion recognition is easily accomplished by human subjects in the presence of moving masking dots (Cutting et al.,
1988;Thornton et al.,
1998). In technical systems that fit models to feature positions, motion clutter leads to complex correspondence problems, which have been solved by applying algorithms for search in high-dimensional spaces (Rashid,
1980; Song, Goncalves, & Perona,
2003). Such algorithms typically require many iterative steps. The computational complexity of these methods seems difficult to reconcile with the experimental fact that biological motion recognition in humans and monkeys is very fast, requiring less than 200 ms (Johansson,
1976; Oram & Perrett,
1996). In addition, it remains an open question whether the required algorithms can be implemented with real neurons (cf. Lee& Mumford,
2003).
Our hypothesis of a recognition of point-light stimuli by an analysis of mid-level optic flow features seems compatible with different imaging studies that report activity, which seems compatible for point-light biological motion stimuli, in areas that are typically associated with the dorsal processing stream (e.g., Grossman et al.,
2000; Ptito, Faubert, Gjedde, & Kupers,
2003; Vaina, Solomon, Chowdhury, Sinha, & Belliveau,
2001). However, other studies also find selective activation by point-light walkers in areas like extrastriate body part area (EBA) and fusiform face area (FFA), which are often assigned to the ventral processing stream (Downing, Jiang, Shuman, & Kanwisher,
2001; Grossman & Blake,
2002). Many studies have failed to find selective activation for point-light walkers in the form-selective area LOC (Grossman et al.,
2000; Ptito et al.,
2003; Vaina et al.,
2001). Thus, It remains an open question how exactly form and motion-selective areas interact during the perception of point-light stimuli.
The importance of opponent motion for the recognition of point-light walkers is suggested by fMRI experiments that show an activation of the kinetic occipital area (KO/V3B) for biological motion stimuli (Santi, Servos, Vatikiotis-Bateson, Kuratate, & Munhall,
2003; Vaina et al.,
2001). This area has previously been associated with the processing of motion edges and moving objects (Dupont et al.,
1997; Orbanet al.,
1995). A critical role of opponent motion for the detection of point-light walkers seems also consistent with data from the neurological patient AF, who could perceive biological motion in spite of a lesion in the dorsal pathway (Vaina, Lemay, Bienfang, Choi, & Nakayama,
1990). Detailed investigations of the lesion sites suggest that this patient still has area V3B/KO intact (Vaina & Giese,
2002), so that his perception of opponent motion might not be strongly impaired.
Our psychophysical and computational results suggest that relative limb motion might be important for the recognition of human locomotion. This finding is consistent with psychophysical results in adults (Pinto & Shiffrar,
1999) and infants (Booth, Pinto, & Bertenthal,
2002). In particular, it was shown that infants at the age of about 5 months shift their interest from the absolute and relative motion of individual limbs to the relative motion of contra-lateral limbs (Booth et al.,
2002).
Although in this study we have focused on possible feed-forward mechanisms for achieving a robust recognition of biological motion, we assume that under normal conditions, biological motion recognition is modulated by higher level cognitive representations. Experimental evidence suggests strong influences of top-down processes (Bülthoff, Bülthoff,& Sinha,
1998; Cavanagh, Labianca, & Thornton,
2001; Thornton, Rensink, & Shiffrar,
2002) and potentially representations of biomechanical plausibility (Shiffrar & Freyd,
1990,
1993). In addition, interactions with internal representations of motor programs might play an important role, a ssuggested by a number of recent psychophysical, neurophysiological, and fMRI studies (Decety & Grezes,
1999; Prinz,
1997; Rizzolatti, Fogassi, & Gallese,
2001; Saygin, Wilson, Hagler, Bates, & Sereno,
2004).
The proposed mechanism (i.e., the detection of critical mid-level motion features) defines a computational hypothesis on how basic visual recognition of normal and impoverished point-light stimuli might be accomplished with high robustness and realistic processing times. However, it seems likely that the human brain integrates a variety of features during biological motion recognition. The proposed critical feature might be particularly important, but more complex tasks like the fine discrimination of actions might require the exploitation of multiple features, or even a modulation of the detection process by high-level cognitive representations.