Abstract
Visual recognition of biological motion recruits form and motion processing supported by both dorsal and ventral pathways. This neural architecture is resembled by the two-stream convolutional neural networks (CNNs) (Simonyan & Zisserman, 2014), which include a spatial CNN to process static image frames, a temporal CNN to process motion flow fields, and a fusion network with fully-connected layers to integrate recognition decisions. The two-stream CNNs showed human-like performance for recognizing actions from natural videos. We tested whether the two-stream CNNs could account for several classic findings in the literature of biological motion perception. In simulation 1, we trained the model with actions in skeletal displays and tested its generalization with actions in point-light (PL) displays. We found that only the temporal CNN can recognize PL actions with high performance. Simulation 2 examined whether the two-stream CNNs could discriminate the form-only PL walker in which local motion was eliminated by randomly sampling points along the limbs in each frame (Beintema & Lappe, 2002). The image stream and the fusion network, but not the motion pathway, showed similar activity pattern as for intact PL walkers, indicating that the model was able to utilize appearance information such as body structure to detect human motion. In simulation 3, we tested whether the two-stream CNNs showed inversion effects (better performance for upright actions than for upside-down actions) across a range of conditions (Troje & Westhoff, 2006). The model showed the inversion effect in most conditions, but not in the spatial-scramble condition. This failure suggests that the feed-forward connections in the two-stream CNN model limit its ability to serve as a “life detector.” The model would require additional long-range connections in its architecture in order to pass special local movements (e.g., foot movements in walking) to later layers for efficient detection and recognition.
Acknowledgement: NSF grant BCS-1655300