Abstract
Deep learning convolutional neural networks (CNN) have shown impressive results on many computer vision tasks. They have also performed well modelling human vision, leading some to suggest that are inherently good models of human visual processing. Since CNNs are classifiers, they typically transform problems into classification and excel when measuring results as an accuracy score. Less well studied is a CNN’s ability to model human errors, mistakes and other incongruities that people make when interpreting their visual world. We tested CNNs in their ability to model human data for cognitive and neural phenomena that highlighted peculiarities of human vision. Specifically, the ability of a CNN trained on upright faces and houses to model results from the face inversion effect (FIE) and the impact of TMS to face and object recognition. We gathered data from 19 participants performing a matching task for faces or houses. Behavioural conditions included upright and inverted stimuli. TMS conditions included rOFA, rOPA or Sham. Human accuracy scores showed a typical FIE and our TMS manipulation reduced the FIE by impairing identification accuracy of upright faces pairs (although we did not replicate the expected double dissociation produced by Pitcher et al(2011) and Dilks et al (2013)). We trained a series of CNNs on upright faces and houses to match human matching accuracy and further tested them on the same inverted stimuli shown to human participants. While we could easily match human performance on upright faces, none of the networks showed the FIE when tested on inverted stimuli. In fact, the only interaction from a CNN solution was a house inversion effect. We further modified our CNN solutions by perturbing the weights of the mid network layers to simulate the virtual lesioning of the TMS conditions. Again, the CNN lesioning was not able to match human TMS results.