Abstract
A challenge in vision science is understanding how the visual system parses the retinal image to represent intrinsic properties of scenes, such as surface reflectance and lighting. Deep learning networks have provided successful new approaches to inferring intrinsic images, and here we investigate these networks as models of human lightness constancy. We examined two state-of-the-art architectures for intrinsic image decomposition (Yu & Smith, 2019; Li et al., 2020), trained on photorealistic images of synthetic scenes. To compare network and human performance, we measured the networks’ estimates of surface reflectance using Mondrian patterns embedded in an indoor scene. A reference patch was shown under a fixed illuminant, and multiple test patches were shown under five different illumination levels. At each illumination level, we rendered 17 reflectance levels of the test patch and interpolated the networks' estimates for each to find a reflectance match to the reference, thus probing the networks’ lightness constancy. We repeated this procedure for three different reference reflectances. We also tested human observers in a corresponding lightness matching task, using the same stimuli presented with a virtual reality display. Human observers showed good lightness constancy, with an average constancy index (CI) of 0.81 across all stimuli. They were also consistent across conditions, with CI standard deviations around 0.10 across reflectance and lighting conditions. The deep learning networks, however, showed poor reflectance constancy, with an average CI of 0.19. Qualitative analysis suggests that the networks often misinterpreted lighting changes as reflectance changes. The networks were also less consistent than humans, with CI standard deviations of 0.34 and 0.21 across reflectance and lighting conditions, respectively. These results show that these deep learning networks do not fully model human lightness constancy. We will discuss potential strategies to address this shortcoming, as well as proposals for further benchmarking such networks.