Abstract
It is rare that humans are required to recognize objects without a surrounding context. Previous research has shown that modifying the scene information can decrease the speed and accuracy of object recognition in human observers. Although convolutional neural networks (CNNs) can attain near human-level performance on simple object recognition tasks, it remains unclear whether these models of biological vision continue to reflect human abilities when objects occur in complex scenes. Here, we investigated the impact of visual clutter and semantic incongruence on object recognition accuracy in humans and CNNs. Eighteen undergraduate students and four CNNs implemented with Pytorch were shown 384 greyscale images consisting of a target object superimposed on a background scene. We manipulated the level of visual clutter, defined as how much texture, pattern, or excess information is in an image, and the semantic congruency, defined as whether the object-scene pairing was realistic. The eight target categories consisted of animals (bear, bison, elephant, owl) and common indoor objects (lamp, teapot, vacuum, vase), which were presented in either outdoor nature scenes or indoor scenes. The scenes were rated on their degree of clutter by separate participants and sorted into low or high clutter scenes. We found that human observers performed significantly worse with increased clutter, yet CNN performance was unaffected by clutter. Interestingly, the CNNs showed significantly better classification accuracy for congruent than incongruent object-scene pairings while the human observers did not. However, human participants did show a congruency bias effect, choosing a congruent category over an incongruent category in a significant portion of trials where they reported low confidence. Our findings reveal notable deviations between human and CNN object classification performance and indicate that CNN models do not process background scene context in the same way that humans do.