Abstract
The current state of the art deep networks show superior performance over classic object recognition methods and in some cases, their accuracy surpasses the human level. However, it is not clear to what extent these deep networks are able to mimic the properties of the human visual system. Previous studies have reported resemblance between the human visual system and deep neural networks by visualizing the trained filters at each layer, however, there are studies that show significant differences between human performance and deep networks in the context of shape discrimination i.e. match-to-sample tasks that require a more complex representation of the object properties. Therefore, it is not clear how these networks perform when the representation is meant to be used for guiding the action i.e. hand movement to intercept a ball. In order to investigate this, we used the egocentric VR screen images from a previously published Virtual Reality (VR) ball catching dataset where the subjects attempted to intercept a VR ball flying in depth. The images were fed to a pre-trained deep network and the extracted deep features were used to train an SVM regression model in order to reproduce the position of hand as ground truth. For comparison, a second SVM model was trained using the calculated features from the kinematics of the ball motion i.e. angular size, velocity, acceleration, and expansion rate. Our results show that the cross-correlation between the activation pattern of deep features and the kinematics features is highest in the first few initial layers of the deep network. This suggests that the initial layers of a deep network when compared to a model of target kinematics, encodes a similar representation of the visual information appropriate for guiding the movement of the hand for a target interception task.