Abstract
We previously showed that deep learning models that estimate intrinsic image components (albedo and illuminance) outperform classic models on lightness constancy tasks. Here, we examine what cues these models rely on. We considered two cue types: natural features such as shadows and shading, and artifacts of ray tracing softwares, which typically produce a residual rendering noise that varies with local illumination. We rendered training, validation and test sets via (1) ray tracing (Blender/Cycles) with 128 photons sampled per pixel (high residual noise); (2) same as (1) but 1024 photons sampled (low noise); (3) Blender’s EEVEE renderer (rasterization engine, no noise). (Noise artifacts are also found in other ray tracing renderers, including Mitsuba.) Networks trained on EEVEE images showed similar performance on all three test sets (and performed much better than classic models), whereas networks trained on Cycles showed best performance with Cycles test images, and worst performance with EEVEE images. To assess dependence on naturalistic cues, we tested the networks on test images with various scene elements removed: (1) cast shadows on the floor; (2) shading; (3) all shadows and shading. In (3), no naturalistic lighting cues were available, and yet models trained on Cycles keep a partial, if low, constancy. These models were also almost unaffected by the removal of shadows and shading (less than 10% decrease in constancy). However, the model trained on EEVEE showed a 50% decrease in constancy when floor shadows were removed, and had lowest constancy in condition (3). These results show that widely used ray tracing methods typically produce artifacts that networks can exploit to achieve lightness constancy. When these artifacts are avoided, networks rely on more naturalistic lighting cues, and still exhibit human levels of constancy. Thus deep networks provide a promising starting point for image-computable models of human lightness and color perception.