Abstract
Humans can easily and quickly identify objects, an ability thought to be supported by category-selective regions in lateral occipital cortex (LO) and ventral temporal cortex (VTC). However, prior evidence for this claim has not distinguished whether category-selective regions represent objects or simply represent complex visual features regardless of spatial arrangement, i.e. texture. If category-selective regions directly support object perception, one would expect that human performance discriminating objects from textures with scrambled object features would be predicted by the representational geometry of category-selective regions. To test this claim, we leveraged an image synthesis approach that provides independent control over the complexity and spatial arrangement of visual features. In a conventional categorization task, we indeed find that BOLD responses from category-selective regions predict human behavior. However, in a perceptual task where subjects discriminated real objects from synthesized textures containing scrambled features, visual cortical representations failed to predict human performance. Whereas human observers were highly sensitive in detecting the real object, visual cortical representations were insensitive to the spatial arrangement of features and were therefore unable to identify the real object amidst feature-matched textures. We find the same insensitivity to feature arrangement and inability to predict human performance in a model of macaque inferotemporal cortex and in Imagenet-trained deep convolutional neural networks. How then might these texture-like representations support object perception? We found that an image-specific linear transformation of visual cortical responses yielded a representation that was more selective for natural feature arrangement, demonstrating that the information necessary to support object perception is accessible, though it requires additional neural computation. Taken together, our results suggest that the role of visual cortex is not to explicitly encode a fixed set of objects but rather to provide a basis set of texture-like features that can be infinitely reconfigured to flexibly identify new object categories.