Abstract
The presence of category-selective areas in ventral temporal cortex (VTC) of humans and other primates has been used to support modular theories of perception containing separable components for the processing of categories such as faces and text. However, substantial evidence supports a non-modular, distributed account of processing containing topographic, graded specialization. Whether the developed system is best characterized as modular or not, a theory of its development is required. We performed small-scale abstract and large-scale visual recognition simulations to understand the development of specialization in tasks with varying degrees of functional overlap. Abstract autoencoder simulations revealed a small benefit from sharing hidden representations across orthogonal input domains – that is, from avoiding modularity. However, when the autoencoder was required to simultaneously encode inputs from both domains, it developed fully modular representations. By varying the fraction of inputs coming from a single domain or multiple domains, we could precisely control the degree of developed modularity. We next examined a deep convolutional neural network trained to recognize objects and faces. A fully shared network performed slightly better than architecturally modular networks matched in total units. Further, the shared network developed substantial but graded specialization for objects and faces, with many units demonstrating domain-preferential mean responses and category-invariant information, while retaining such properties for the non-preferred domain. In ongoing work with a map-like deep convolutional recurrent neural network, we find that a simple and biologically-plausible scaling of connection noise or probability with axon distance may be sufficient to produce localized face-selective clusters. Our modeling approach demonstrates that graded, localized specialization may emerge from optimizing hidden representations for multiple tasks under architectural constraints, and that such graded specialization may be preferable to modularity even in the abstract scenario of representing orthogonal patterns. Our results thus weaken the case for full-fledged modularity in visual recognition.