Abstract
Human visual perception captures the 3D shape of objects. While convolutional neural networks (CNNs) are similar to aspects of human visual processing, there is a well-documented gap in performance between CNNs and humans on shape processing tasks. A new deep learning approach, 3D neural fields (3D-NFs), has driven remarkable recent progress in 3D computer vision. 3D-NFs encode the geometry of objects in a coordinate-based representation (e.g. input: xyz coordinate, output: volume density and RGB at that position). Here, we investigate whether humans and 3D-NFs display similar behavior on 3D match-to-sample tasks. In each trial, a participant sees a rendered sample image of a manmade object, then matches it to a target image of the same object from a different viewpoint versus a lure image of a different object. We trained 3D-NFs that take an image as input, then output a rendered image of the depicted object from a new viewpoint (multi-view loss). A trial is correct if 3D-NFs computed from the sample and target images are more similar than 3D-NFs computed from the sample and lure images. In Experiment 1 (n=120), 3D-NF behavior is more similar to human behavior than standard object-recognition CNNs, regardless of whether lure objects were from a.) a different category than the target, b.) the same category as the target, or c.) matched to have the most similar 3D-NF to the target as possible. In Experiment 2 (n=200), we create 5 difficulty conditions using 25 CNNs. Again, we find remarkable agreement between the 3D-NFs and human behavior, with both largely unaffected by the CNN-defined conditions. In Experiment 3 (n=200), we replicate Experiment 2 using algorithmically generated shapes with no category structure. Overall, 3D-NFs and humans show similar patterns of behavior for 3D shape judgements, suggesting 3D-NFs as a promising framework for investigating human 3D shape perception.