October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Weak integration of form and motion in two-stream CNNs for action recognition
Author Affiliations & Notes
  • Yujia Peng
    University of California, Los Angeles
  • Tianmin Shu
    Massachusetts Institute of Technology
  • Hongjing Lu
    University of California, Los Angeles
  • Footnotes
    Acknowledgements  NSF grant BCS-1655300
Journal of Vision October 2020, Vol.20, 615. doi:https://doi.org/10.1167/jov.20.11.615
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Yujia Peng, Tianmin Shu, Hongjing Lu; Weak integration of form and motion in two-stream CNNs for action recognition. Journal of Vision 2020;20(11):615. https://doi.org/10.1167/jov.20.11.615.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Human-level performance in action recognition has been achieved by the two-stream convolutional neural networks (CNN), which include a spatial CNN for analyzing appearance information in images, a temporal CNN for analyzing optical flow information in body movements, and a fusion module to integrate the two processes. We examine the contributions of the three modules to the recognition of actions in point-light and skeletal displays. In simulation 1, we trained the two-stream CNNs with raw videos and skeletal displays of human actions and tested whether the model can recognize actions in the point-light display. We found that the final recognition from the fusion module showed worse performance than did the temporal CNN alone, suggesting the model overweights appearance features extracted from the spatial CNN than motion features from the temporal CNN. Simulation 2 used walking actions to examine whether walking directions impact on facing direction discrimination. The temporal CNN showed better discrimination for in-place walking than for backward walking and moonwalk. In contrast, the final decision from the fusion module did not show this effect. Simulation 3 trained the two-stream CNNs to discriminate three types of walking actions, i.e., forward walking, backward walking, and moonwalk. The temporal CNN of motion processing achieved higher accuracy (.82) than the final recognition from the fusion module (.6). Further generalization test showed that the temporal CNN is sensitive to the causal direction of limb movements and body displacements, but the final decisions from the fusion module fail to show the sensitivity. We conclude that two-stream CNNs extract important form and motion features from action stimuli, however, the integration of the two processes in the fusion module is suboptimal to account for human performance in action recognition and understanding.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.