Abstract
Human-level performance in action recognition has been achieved by the two-stream convolutional neural networks (CNN), which include a spatial CNN for analyzing appearance information in images, a temporal CNN for analyzing optical flow information in body movements, and a fusion module to integrate the two processes. We examine the contributions of the three modules to the recognition of actions in point-light and skeletal displays.
In simulation 1, we trained the two-stream CNNs with raw videos and skeletal displays of human actions and tested whether the model can recognize actions in the point-light display. We found that the final recognition from the fusion module showed worse performance than did the temporal CNN alone, suggesting the model overweights appearance features extracted from the spatial CNN than motion features from the temporal CNN. Simulation 2 used walking actions to examine whether walking directions impact on facing direction discrimination. The temporal CNN showed better discrimination for in-place walking than for backward walking and moonwalk. In contrast, the final decision from the fusion module did not show this effect. Simulation 3 trained the two-stream CNNs to discriminate three types of walking actions, i.e., forward walking, backward walking, and moonwalk. The temporal CNN of motion processing achieved higher accuracy (.82) than the final recognition from the fusion module (.6). Further generalization test showed that the temporal CNN is sensitive to the causal direction of limb movements and body displacements, but the final decisions from the fusion module fail to show the sensitivity. We conclude that two-stream CNNs extract important form and motion features from action stimuli, however, the integration of the two processes in the fusion module is suboptimal to account for human performance in action recognition and understanding.