July 2013
Volume 13, Issue 9
Free
Vision Sciences Society Annual Meeting Abstract  |   July 2013
Statistics of spatial-temporal concatenations of features at human fixations in action classification
Author Affiliations
  • xin chen
    brain and behavior discovery institute, georgia health sciences university\nvision discovery institute, georgia health sciences university
  • xiaoyuan zhu
    brain and behavior discovery institute, georgia health sciences university\nvision discovery institute, georgia health sciences university
  • weibing wan
    department of automation, shanghai jiaotong university
  • zhiyong yang
    brain and behavior discovery institute, georgia health sciences university\nvision discovery institute, georgia health sciences university
Journal of Vision July 2013, Vol.13, 520. doi:https://doi.org/10.1167/13.9.520
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      xin chen, xiaoyuan zhu, weibing wan, zhiyong yang; Statistics of spatial-temporal concatenations of features at human fixations in action classification. Journal of Vision 2013;13(9):520. https://doi.org/10.1167/13.9.520.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Humans can detect, recognize, and classify a range of actions quickly. What are the spatial-temporal features and computations that underlie this ability? Global representations such as spatial-temporal volumes can be highly informative, but depend on segmentation and tracking. Local representations such as histograms of optic flow lack descriptive power and require extensive training. Recently, we developed a model in which any human action is encoded by a spatial-temporal concatenation of natural action structures (NASs), i.e., sequences of structured patches in human actions at multiple spatial-temporal scales. We compiled NASs from videos of natural human actions, examined the statistics of NASs, and selected a set of NASs that are highly informative and used them as features for action classification. We found that the NASs obtained in this way achieved a significantly better recognition performance than simple spatial-temporal features. To examine to which extend this model accounts for human action understanding, we hypothesized that humans search for informative NASs in this task and performed visual psychophysical studies. We asked 12 subjects with normal vision to classify 500 videos of human actions while tracking their fixations with an EyeLink II eye tracker. We examined statistics of the NASs compiled at the recorded fixations and found that human observers' fixations were sparsely distributed and usually deployed to locations in space-time where concatenations of local features are informative. We selected a set of NASs compiled at the fixations and used them as features for action classification. We found that the classification accuracy is comparable to human performance and to that of the same model but with automatically selected NASs. We concluded that encoding natural human actions in terms of NASs and their spatial-temporal concatenations accounts for aspects of human action understanding.

Meeting abstract presented at VSS 2013

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×