Abstract
Humans can detect, recognize, and classify a range of actions quickly. What are the spatial-temporal features and computations that underlie this ability? Global representations such as spatial-temporal volumes can be highly informative, but depend on segmentation and tracking. Local representations such as histograms of optic flow lack descriptive power and require extensive training. Recently, we developed a model in which any human action is encoded by a spatial-temporal concatenation of natural action structures (NASs), i.e., sequences of structured patches in human actions at multiple spatial-temporal scales. We compiled NASs from videos of natural human actions, examined the statistics of NASs, and selected a set of NASs that are highly informative and used them as features for action classification. We found that the NASs obtained in this way achieved a significantly better recognition performance than simple spatial-temporal features. To examine to which extend this model accounts for human action understanding, we hypothesized that humans search for informative NASs in this task and performed visual psychophysical studies. We asked 12 subjects with normal vision to classify 500 videos of human actions while tracking their fixations with an EyeLink II eye tracker. We examined statistics of the NASs compiled at the recorded fixations and found that human observers' fixations were sparsely distributed and usually deployed to locations in space-time where concatenations of local features are informative. We selected a set of NASs compiled at the fixations and used them as features for action classification. We found that the classification accuracy is comparable to human performance and to that of the same model but with automatically selected NASs. We concluded that encoding natural human actions in terms of NASs and their spatial-temporal concatenations accounts for aspects of human action understanding.
Meeting abstract presented at VSS 2013