Abstract
Deep Neural Networks (DNNs) have recently been put forward as computational models for feedforward processing in the human and monkey ventral streams. Not only do they achieve human-level performance in image classification tasks, recent studies also found striking similarities between DNNs and ventral stream processing systems in terms of the learned representations (e.g. Cadieu et al., 2014, PLOS Comput. Biol.) or the spatial and temporal stages of processing (Cichy et al., 2016, arXiv). In order to obtain a more precise understanding of the similarities and differences between current DNNs and the human visual system, we here investigate how classification accuracies depend on image properties such as colour, contrast, the amount of additive visual noise, as well as on image distortions resulting from the Eidolon Factory. We report results from a series of image classification (object recognition) experiments on both human observers and three DNNs (AlexNet, VGG-16, GoogLeNet). We used experimental conditions favouring single-fixation, purely feedforward processing in human observers (short presentation time of t = 200 ms followed by a high contrast mask); additionally, we used exactly the same images from 16 basic level categories for human observers and DNNs. Under non-manipulated conditions we find that DNNs indeed outperformed human observers (96.2% correct versus 88.5%; colour, full-contrast, noise-free images). However, human observers clearly outperformed DNNs for all of the image degrading manipulations: most strikingly, DNN performance severely breaks down with even small quantities of visual random noise. Our findings reinforce how robust the human visual system is against various image degradations, and indicate that there may still be marked differences in the way the human visual system and the three tested DNNs process visual information. We discuss which differences between known properties of the early and higher visual system and DNNs may be responsible for the behavioural discrepancies we find.
Meeting abstract presented at VSS 2017