Abstract
Two decades of research in material perception shows that humans can estimate material properties in diverse scenes. However, it still lacks a computational framework that explains these perceptual data. Deep neural network trained on large data has succeeded in object recognition and classification. Will it succeed in estimating material properties of objects in dynamic scenes? One challenge is that it is difficult to obtain clear-cut labels for material properties. In addition, there is no simple relationship between perceived material attributes with the physical measurements. Here we use physics-based cloth simulations as our database and train neural networks with a variety of architectures to understand the computational mechanism of estimating mechanical properties of cloth. We hypothesize that network architectures that matches best with human performance would likely to be adapted by the human visual system. The dataset contain 1764 animations containing moving clothes rendered with 7 bending stiffness values, 6 mass values, 6 textures and in 3 dynamic scenes. First, we finetune a pre-trained ResNET (CNN) to extract features for static frames of maximum deformations with ground-truth parameters as labels. Second, we connect a long-short-term memory network (LSTM) in series with the CNN to train videos with the same labels. Thirdly, we use MDS method to obtain perceptual clusters of these stimuli and used these clusters as labels to train the videos with the same architecture (CNN + LSTM). Finally, we built a two-stage network where the first network performs unsupervised learning to cluster data to get the labels, and the second network uses those labels and learns a discriminative model to classify the videos. We use fisher vectors generated from dense trajectories of videos to represent the data in the first network. We find that the two-stage network outperforms the other architectures and the results are closest to human perception.