Abstract
What kind of information does the human brain use when we perceive a natural scene? We investigated this question using representations from 20 visually task-specific deep neural networks trained on natural scenes. Using whole brain data from two of the largest fMRI datasets on natural scenes, NSD (Natural Scene Dataset) and BOLD5000, we built voxelwise encoding models that use network representations learned individually for each task to predict brain responses for viewing these scenes. Our results show that networks trained on 2D and 3D tasks explain distinct variance in the brain. In particular, we found that high-level visual processing is better explained by 3D representations. Moreover, those neural network models that learned to focus on different images regions to perform their tasks were able to predict distinct receptive fields along the visual pathway. In aggregate, the individual brain prediction maps from each task representation enabled us to recover a landscape explicating how task-related information is processed across the brain. More generally, we suggest that using representations from a pool of task-driven deep neural networks, provides a means for combining the power of deep learning in extracting complex representations with interpretability to better explain complex processes in the human brain.