The main reason for the second objection—that intermediate representations obviously could not exist—is that simply combining signals from each layer to make more complex representations would require far too many computational units (e.g., neurons). For instance, if we were to combine in a pairwise fashion the outputs of all V1 neurons in area V2, then V2 would need to contain the square of the number of neurons in V1, which it clearly does not. Similarly, the next stage would need some factorial combination of the cells in V2, and we quickly find a combinatorial explosion of entities that need to be encoded. This is clearly a problem that needs to be considered and addressed, but maybe not a reason to discard the entire endeavor. The circuits leading up to the primary visual cortex certainly are taking essentially this approach. The representation of spatiotemporal contrast found in ganglion and lateral geniculate nucleus (LGN) cells is constructed by a center-surround organization of excitatory and inhibitory inputs from photoreceptors via a complex intermediate circuit involving horizontal, bipolar, and amacrine cells (Dowling & Boycott,
1966). Similarly, by combining the outputs of these units we could create oriented receptive fields with inhibitory and excitatory regions, as found in V1 simple cells (Hubel & Wiesel,
1962). People do not seem to have raised the combinatorial explosion issue for these steps; there is not a claim that there are too many possible combinations for V1 to represent combined LGN outputs. Possibly, that is because V1 has more neurons than the LGN, but note that V1 contains approximately 140 million neurons (Wandell,
1995) compared with the approximately 1.5 million retinal ganglion cells (Hecht,
2001), an increase that is insufficient to handle the sort of exponential increase implied by the combinatorial explosion. Furthermore, the step from photoreceptor to ganglion cell constitutes a reduction of processing units from approximately 4.6 million (Curcio, Sloan, Kalina, & Hendrickson,
1990) to approximately 1.5 million retinal ganglion cells (Hecht,
2001). It seems curious, then, that we are willing to accept that LGN outputs are combined by V1 to make a new, more complex representation but unwilling to accept that some similar process occurs beyond that. Indeed, given that neurons work by combining, with a mixture of excitation and inhibition, the outputs of the preceding layer (and potentially feedback from higher layers), it seems that some form of combination of outputs must occur. The problem, then, is “simply” to understand what form of combination that takes: what the representation looks like, how many units' signals are combined in any one step, and how the combinatorial explosion is averted.