Abstract
Convolutional neural networks (CNNs) have led to a major advance in machine-based object classification performance and are sometimes described as approaching near-human levels of recognition accuracy. CNNs are also of great interest because they provide a biologically plausible model of object recognition; however, to what extent does their performance resemble that of human observers? The goal of this study was to evaluate the similarity and robustness of human and CNN performance, by presenting objects under challenging viewing conditions with varying levels of visual white noise. For the CNN, we relied on the AlexNet model (Krizhevsky et al., 2012) and images obtained from the 2012 ImageNet Large Scale Visual Recognition Competition. We pre-selected 16 out of the 1000 object categories (i.e., 8 animate and 8 inanimate categories) to compare performance across man and machine. Participants were briefly presented with each of 800 object images just once, at a randomly determined signal-to-noise ratio (SNR), and asked to identify which of the 16 categories was shown. To compare human and CNN performance, we normalized and fitted the performance data using a modified sigmoid function to determine the threshold SNR needed to obtain 50% accuracy at this identification task. Human observers required ~25% signal in the images to reach threshold levels of recognition accuracy, while AlexNet required greater than 60% signal on average. Moreover, humans were generally better at recognizing inanimate than animate objects, while the CNN showed no clear categorical advantage. These results suggest that human recognition is much more robust to visual noise than current CNNs, and that people may be able to infer diagnostic features of objects at much lower levels of SNR.
Meeting abstract presented at VSS 2017