Abstract
Task-optimized deep neural networks (DNNs) have been shown to yield impressively accurate predictions of brain activity in the primate visual system. For most networks, network layer depth generally aligns with V1-V4. This result has been construed as evidence that V1-V4 instantiates hierarchical computation. To test this interpretation, we analyzed the Natural Scenes Dataset, a massive dataset consisting of 7T fMRI measurements of human brain activity in response to up to 30,000 natural scene presentations per subject. We used this dataset to directly optimize DNNs to predict responses in V1-V4, flexibly allowing features to distribute across layers in any way that improves prediction of brain activity. Our results challenge three aspects of hierarchical computation. First we find only marginal advantage from jointly training on V1-V4 relative to training independent DNNs on each of these brain areas. This suggests that data from different areas offer largely independent constraints on the model. Second, the independent DNNs do not show the typical alignment of network layer depth with visual areas. This suggests that alignment may arise for other reasons than computational depth. Finally, we performed transfer learning between the DNN features learned on each visual area. We show that features learned on anterior areas (e.g. V4) poorly generalized to the representations found in more posterior areas (e.g. V1). Together, these results indicate that the features represented in V1-V4 do not necessarily bear hierarchical relationships to one another. Overall, we suggest that human visual areas V1-V4 do not only serve as a pre-processing stream for generating higher visual representations, but may also operate as a parallel system of representation that can serve multiple independent functions.