Abstract
When viewing actions, we not only recognize patterns of body movements, but we also "see" the intentions and social relations of people. However, there are individual differences in the ability of inferring intentions from action observations. Experienced forensic examiners, Closed Circuit Television (CCTV) operators, show superior performance to novices in predicting and identifying hostile intentions in complex scenes. However, it remains unknown what visual features CCTV operators actively attend to when viewing surveillance footages, and whether attended features differ between experts and novices. In this study, we analyzed visual contents in image patches centered on the gaze fixations of CCTV operators and of novices when they viewed the same surveillance footage of activity preceding harmful interactions and control conditions. First, a visual saliency model was used to examine the impact of salient image features (e.g., luminance, color, motion) on guiding gaze fixations in viewing dynamic scenes. We did not find a group difference between experts and novices in terms of visual saliency of attended stimuli. We then employed a deep convolutional neural network (DCNN) model to extract DCNN features from the penultimate layer of the network. Through machine learning classifiers, DCNN features from the gaze-centered input can distinguish experts from novices, specifically for surveillance videos that precede harmful interactions (e.g., fighting). DCNN features also showed greater inter-subject correlations among CCTV operators than for novices. The results suggest that experts such as CCTV operators are more efficient in attending to object-relevant information that likely is associated with social context and relationships between agents. Such differences in actively attending to high-level visual information enable the experts to better predict harmful intentions in human activities.