Abstract
Humans are constantly processing scene information from their environment, requiring quick and accurate decision-making and behavioural responses. Despite the importance of this process, it remains unknown which cortical representations might underlie this function. Additionally, to date there is no unifying model of scene categorization which can predict both neural and behavioural correlates as well as their relationship. Here, we approached these questions empirically and via computational modelling using deep neural networks. First, to determine which scene representations are suitably formatted for behaviour, we collected electroencephalography (EEG) data and reaction times from human subjects during a scene categorization task (natural vs. man-made) and an orthogonal task (fixation cross colour discrimination). Then, we linked the neural representations with reaction times in a within-task or a cross-task analysis using the distance-to-hyperplane approach, a multivariate extension of signal detection theory. We observed that neural data and categorization reaction times were correlated between ~100 ms and ~200 ms after stimulus onset, even when neural data were from the orthogonal task. This identifies when post stimulus representations suitably formatted for behavior emerge. Second, to provide a unified model of scene categorization, we evaluated a recurrent convolutional neural network in terms of its capacity to predict a) human neural data, b) human behavioural data, and c) the brain-behaviour relationship. We observed similarities between the network and humans on all levels: the network correlated strongly with humans with respect to neural representations and reaction times. In terms of the brain-behaviour relationship, EEG data correlated with network reaction times between ~100 ms and ~200 ms after stimulus onset, mirroring the results from the empirical analysis. Altogether, our results provide a unified empirical and computational account of scene categorization in humans.