Abstract
Although ‘glossiness’ is an optical property of materials, while ‘softness’ is a mechanical property, there is an intriguing perceptual connection between the two as both specular reflections and shape deformations produce distinctive motion patterns. Observers are generally excellent at determining properties of moving surfaces. However, under certain circumstances, reflections and deformations can actually be confused, with rigidly transforming mirrors appearing non-rigid and deforming matte-textured objects occasionally appearing somewhat shiny. Here, we investigated whether similar broad successes and specific confusions also arise in an unsupervised recurrent neural network (PixelRNN) trained to predict video sequences. We previously found that such networks reproduce several key gloss perception phenomena, including the ‘sticky-reflection’ effect (Doerschner et al., 2011, Current Biology), wherein reflections that move with the surface (instead of sliding across it) appear like matte texture markings. We generated 8,000 20-frame movies of objects with diverse appearances and motions by varying the shape, glossiness, soft-body properties (including rigid objects), illumination environments, texture maps, and various rotational directions and speeds. After training, the PixelRNN could synthesize accurate video predictions up to 10 frames into the future. T-SNE visualization of the embedding of new stimuli in the network’s internal representation revealed a clear clustering and disentanglement of the different object types by the network, as confirmed by logistic linear classification according to their surface glossiness and deformability. Over the course of just the first three frames, classification of softness increased dramatically from chance to 99% accuracy. When participants compared the glossiness and rigidity of pairs of stimuli from the test set, we found that human judgements correlated with predictions derived from the embedding of stimuli in the network’s representation. These findings demonstrate that unsupervised predictive learning can disentangle softness and glossiness properties of objects, much like humans, without any explicit training about the distal object properties.