**Abstract**:

**Abstract**
**The** **first two areas of the primate visual cortex (V1, V2) provide a paradigmatic example of hierarchical computation in the brain.** **However, neither the functional properties of V2 nor the interactions between the two areas are well understood. One key aspect is that the statistics of the inputs received by V2 depend on the nonlinear response properties of V1. Here, we focused on divisive normalization, a canonical nonlinear computation that is observed in many neural areas and modalities. We simulated V1 responses with (and without) different forms of surround normalization derived from statistical models of natural scenes, including canonical normalization and a statistically optimal extension that accounted for image nonhomogeneities. The statistics of the V1 population responses differed markedly across models. We then addressed how V2 receptive fields pool the responses of V1 model units with different tuning. We assumed this is achieved by learning without supervision a linear representation that removes correlations, which could be accomplished with principal component analysis. This approach revealed V2-like feature selectivity when we used the optimal normalization and, to a lesser extent, the canonical one but not in the absence of both. We compared the resulting two-stage models on two perceptual tasks; while models encompassing V1 surround normalization performed better at object recognition, only statistically optimal normalization provided systematic advantages in a task more closely matched to midlevel vision, namely figure/ground judgment. Our results suggest that experiments probing midlevel areas might benefit from using stimuli designed to engage the computations that characterize V1 optimality.**

*flexible*surround model because the surround normalizes the center only to the degree that the inputs to the center and surround are inferred to be statistically dependent. We compared this to canonical surround normalization, which corresponds to approximate inference: The center and surround are assumed always dependent and thus always divisively normalize each other. In addition, we included a descriptive model of complex cells that was not matched to scene statistics and did not involve any normalization.

*flexible*divisive normalization explains a large range of experimental findings on V1 surround modulation, including some that are not captured by the canonical divisive normalization (Coen-Cagli et al., 2012). In the next section, we formalize this intuition with a Bayesian mixture model: Neural responses still reflect inferences in the model, but now there are two steps of inference. For any given input image, the model uses Bayesian inference first to estimate the extent to which the RF outputs are dependent or independent and then to estimate the neural response, which includes surround normalization only in the former case, not the latter.

*flexible surround*divisive normalization model, the

*canonical surround*divisive normalization model, and the complex cell energy model (denoted

*no surround*in the following). We start by describing the bank of linear filters we used to model center and surround RFs; then we describe the typical dependencies between RF outputs to natural images and provide details of the generative model. Then we present, for each of the three models above, the equations for the neural response to a given input image, and we explain how they correspond to optimal or suboptimal inferences. We note that the V1 modeling is not novel to this paper, and all the technical aspects were covered in detail in Schwartz et al. (2006) and Coen-Cagli et al. (2009, 2012). Nonetheless, for completeness, we provide in this section (and illustrate in Figure 2), the equations that are necessary to understand the mechanics of the V1 models; we also provide further derivations in Appendices I and II.

*θ*(possible values are 0°, 45°, 90°, 135°), phase

*φ*(possible values are 0° and 90° to form a quadrature pair), and position

*x,y*(possible values are minus six, zero, or six pixels, corresponding to a three by three grid with a six-pixel separation, centered at

*x*= 0,

*y*= 0), and we denote the output of the generic RF by

*z*. In this paper, we distinguish RFs at the center of the input image (

_{θ,φ,x,y}*x*= 0,

*y*= 0) and in the surround. For convenience, we denote the generic center RF output by and the generic surround RF output by:

**k**and

**s**

^{(θ)}for the center and surround positions, respectively. Let us also introduce two vectors of Gaussian variables, denoted by

*κ*and

*σ*

^{(θ)}, with the same dimensionality as

**k**and

**s**

^{(θ)}, respectively; the vector formed by concatenating

*κ*and

*σ*

^{(θ)}has zero mean and covariance

**C**

*. Let us then introduce a positive, scalar random variable denoted by*

_{κσ}*ν*(which we will refer to as the

*mixer*). The GSM describes the RF outputs as the variables obtained by the following multiplication:

*κ*and

*σ*

^{(θ)}are fully described by their covariance matrix

**C**

*, the joint probability distribution of the elements of*

_{κσ}**k**and

**s**

^{(θ)}now have also a higher-order dependence because the multiplication by the common mixer

*ν*scales all the variances simultaneously (for this reason, this specific form of higher-order dependence is sometimes also called variance dependence). A limitation of Equation 5 is that it assumes stationary dependencies between RF outputs. However, natural images are highly nonstationary, and more sophisticated models have been introduced to account for the fact that different parts of natural images could produce different RF dependencies (Guerrero-Colon et al., 2008; Hammond & Simoncelli, 2008; Karklin & Lewicki, 2005, 2009; Kivinen et al., 2007; Schwartz et al., 2006). In particular, Coen-Cagli et al. (2009, 2012) proposed that while image patches entirely contained within a single object may be well described by Equation 5, patches that sit at the boundary between different objects produce more independent center and surround RF outputs that are better described by where

*ν*

_{k}, ν_{s}are independent mixer variables, and the Gaussian variables

*κ*,

*σ*

^{(θ)}can be correlated only within each group (with covariance matrices

**C**

*and*

_{κ}**C**

*, respectively) but not across groups. We formalized this intuition using a MGSM (Coen-Cagli et al., 2009), with one mixture component accounting for dependent RFs (Equation 5), the other mixture component accounting for independent RFs (Equation 6), and a weighted sum of the two components accounting for the generic, intermediate case.*

_{σ}^{1}We provide in Appendix I the analytical derivation of the generative model.

**k**,

**s**

^{(θ)}}

**C**

*) and patches that do not (*

_{κσ}**C**

*,*

_{κ}**C**

*) and (b) the prior probability that any given natural image patch does actually contain a global dependence. We maximized the log-likelihood numerically; details are provided in Appendix II and Coen-Cagli et al. (2009). For the training set, we used 75,000 patches randomly sampled from a database of five images (Boats, Goldhill, Mountain, Einstein, Crowd) often used for image compression benchmarks and available online at http://neuroscience.aecom.yu.edu/labs/schwartzlab/code/standard.zip. The patches were 21 by 21 pixels, large enough to cover center and surround RFs.*

_{σ}*θ*and phase

*φ*as essentially reversing the operation that generated the higher-order dependence between

*k*and the other RFs, thus computing an estimate

_{θ,φ}*κ̄*

_{θ}_{,Φ}of the corresponding local Gaussian variable

*κ*. In principle, because the generative model is multiplicative, estimating the local Gaussian amounts to the reverse process, i.e., division by the mixer. Thus, the framework based on image statistics naturally encompasses, and extends, divisive normalization. However, for any given visual input, only the RF outputs are observed whereas it is not known whether such outputs were generated according to Equation 5 (dependent center and surround) or Equation 6 (independent center and surround) nor the actual value of the mixer variable. Therefore, one has to use Bayesian inference: First, we need to compute the probability that the observed RF outputs are indeed statistically dependent—we denote such probability by

_{θ,φ}*q*(

**k**

*,*

**s**

^{(θ)}—to emphasize that it is a function of the observed RF outputs. Second, we need to estimate the values of the mixers under the two conditions (

*ν*for the dependent case,

*ν*for the independent case) and perform the division. Eventually, we combine estimates for the two phases to obtain the complex cell response:

_{k}*flexible*surround divisive normalization (Figure 2B): where the normalization terms in the denominator are given by

*q*(

**k**,

**s**)) is derived in Appendix I, but the basic intuition is illustrated in Figure 2B: Images that are very similar (or different) in the center and surround elicit values of

*q*close to one (or zero).

**C**

*optimized to natural scenes is the fact that it produces a form of collinear facilitation in the V1 responses. Both the flexibility of the normalization pool and collinear facilitation, play an important role in shaping the statistics of the V1 responses and the selectivity that emerges downstream, which we address in the next section.*

_{κσ}*canonical*surround divisive normalization (Cavanaugh et al., 2002; Heeger, 1992). This corresponds to approximate inference in the MGSM, in which the approximation consists of (a) assuming that

*q*(

**k**

*,*

**s**

^{(θ)}) = 1 or, in other words, that center and surround RFs are always dependent and therefore to be normalized and (b) that

**C**

_{κ}*is proportional to the identity matrix rather than being optimized to natural scenes. The estimate of the Gaussian component in this case amounts simply to*

_{σ}*no surround*). In this case, we simply replace, in Equation 7, the Gaussian estimate

*κ̄*

_{θ}_{,Φ}by the corresponding RF output

*k*:

_{θ,φ}**R**(after subtracting the sample mean), a 1,296 by 10,000 matrix. PCA assumes that the correlations between the dimensions of

**R**arise from mixing linearly a set of independent sources

**Z**via an orthonormal matrix of weights

**W**, such that

**R = W · Z**and

**Z**=

**W**

^{⊤}· R. We take the columns of

**W**to define the RFs of the V2 units and

**Z**their responses to the visual inputs. The weight matrix

**W**is usually obtained by the eigenvector decomposition of the data covariance: We computed the 1,296 by 1,296 sample covariance matrix ∑ =

**RR**

^{⊤}and performed its eigenvector decomposition. The columns of

**W**are the eigenvectors of Σ or the principal components (PCs). The collection of weights of each V1 unit to a given PC thus defined the RF of the V2 unit corresponding to that PC.

*z*scoring) as is often done in the computer vision literature (Figures 6B and 7B); this however did not change the relationship between the three models.

*p*value for the hypothesis that the weight is different from zero): This measure correlated with the PC rank significantly more for the flexible surround (68% c.i. of the median correlation [.37 .41]) than for canonical ([.24 .25]) and no surround ([.20 .27]).

*intelligent*normalization. In that case, though, the normalization is trained with supervision to optimize discrimination as opposed to the unsupervised training of the MGSM.

*Journal of the Optical Society of America A**,*2

*,*284–299. [CrossRef]

*, 10 (10), 1313–1321. [CrossRef] [PubMed]*

*Nature Neuroscience**(pp. 217–234). Cambridge, MA: MIT Press.*

*Sensory communication**, 31 (43), 15310–15319. [CrossRef] [PubMed]*

*Journal of Neuroscience*

*Vision Research**,*37 (23), 3327–3338. [CrossRef] [PubMed]

*(pp. 2559–2566). San Francisco, CA: IEEE.*

*Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR'10)**, 13 (1), 51–62.*

*Nature Reviews Neuroscience**, 88 (5), 2530–2546. [CrossRef] [PubMed]*

*Journal of Neurophysiology**. 15 (pp. 215–223 ). Fort Lauderdale, FL: JMLR W&CP.*

*Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (aistats) Vol*

*Nips**,*(pp. 369–377). Cambridge, MA: MIT Press.

*, 8 (3), e1002405. [CrossRef] [PubMed]*

*PLoS Computational Biology**. Cambridge, MA: MIT Press.*

*Theoretical neuroscience: Computational and mathematical modeling of neural systems*

*Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004) Workshop**,*(pp. 178–186). Washington, DC: IEEE.

*Journal of Vision**,*7 (8): 2, 1–9, http://www.journalofvision.org/content/7/8/2, doi:10.1167/7.8.2. [PubMed] [Article] [CrossRef] [PubMed]

*Nature Neuroscience**,*16 (7), 974–981. [CrossRef] [PubMed]

*(pp. 505–512). Cambridge, MA: MIT Press.*

*Nips*

*Journal of Comparative Neurology**,*201

*,*519–539. [CrossRef] [PubMed]

*(pp. 565–568). San Diego: IEEE.*

*Proceedings of the 15th IEEE International Conference on Image Processing**, 17 (11), 2089–2101. [CrossRef] [PubMed]*

*IEEE Transactions on Image Processing**, 50 (22), 2295–2307. [CrossRef] [PubMed]*

*Vision Research*

*Visual Neuroscience**,*9

*,*181–198. [CrossRef] [PubMed]

*, 20 (RC61), 1–6. [PubMed]*

*Journal of Neuroscience*

*Trends in Cognitive Sciences**,*11

*,*428–434. [CrossRef] [PubMed]

*, 42 (12), 1593–1605. [CrossRef] [PubMed]*

*Vision Research*

*BMC Neuroscience**,*6 (1), 12–24. [CrossRef] [PubMed]

*, 24 (13), 3313–3324. [CrossRef] [PubMed]*

*Journal of Neuroscience**(pp. 2146–2153). Kyoto, Japan: IEEE.*

*Proceedings of the International Conference on Computer Vision (ICCV'09)**, 457 (7225), 83–86. [CrossRef] [PubMed]*

*Nature*

*Neural Computation**,*17

*,*397–423. [CrossRef] [PubMed]

*(pp. 1–8). Rio de Janeiro, Brazil: IEEE.*

*Proceedings of the International Conference on Computer Vision (ICCV'07)*

*Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004)**,*Washington, DC: IEEE.

*. Cambridge, MA: MIT Press.*

*Advances in neural information processing systems 20**19 ( pp. 897–904). Cambridge, MA: MIT Press.*

*Advances in neural information processing systems**( pp. 1–5). Nashville, TN: IEEE.*

*Symposium on Computational Intelligence for Multimedia Signal and Vision Processing**, 70 (10–12), 1920–1924. [CrossRef]*

*Neurocomputing**, 89 (1–3), 273–279. [CrossRef] [PubMed]*

*Biosystems*

*Nature**,*381

*,*607–609. [CrossRef] [PubMed]

*, 18 (2), 381–414. [CrossRef] [PubMed]*

*Neural Computation**(pp. 786–792).*

*Nips**, 4 (1), e27. [CrossRef] [PubMed]*

*PLoS Computational Biology**, 40 (1), 49–70. [CrossRef]*

*International Journal of Computer Vision**(pp. 614–627). Graz, Austria: Springer.*

*Proceedings of the European Conference on Computer Vision*

*Nature Neuroscience**,*2

*,*1019–1025. [CrossRef] [PubMed]

*Journal of Neuroscience**,*22 (13), 5639–5651. [PubMed]

*, 46 (6), 945–956. [CrossRef] [PubMed]*

*Neuron*

*Journal of Vision**,*1 (13): 25, 1–24, http://www.journalofvision.org/content/13/1/25, doi:10.1167/1.13.25. [PubMed] [Article] [CrossRef]

*Journal of Vision**,*9 (4): 19, 1–20, http://www.journalofvision.org/content/9/4/19, doi:10.1167/9.4.19. [PubMed] [Article] [CrossRef] [PubMed]

*, 18 (11), 2680–2718. [CrossRef] [PubMed]*

*Neural Computation**, 4 (8), 819–825. [CrossRef] [PubMed]*

*Nature Neuroscience*

*Journal of Physiology Paris**,*97 (4–6), 453–474. [CrossRef]

*(pp. 1–8). Anchorage, AK: IEEE.*

*Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR ‘08)**(pp. 1273–1280). Cambridge, MA: MIT Press.*

*Nips**, 102 (4), 2069–2083. [CrossRef] [PubMed]*

*Journal of Neurophysiology**, 38 (2), 587–607. [CrossRef]*

*IEEE Transactions on Information Theory*

*Annual Review of Neuroscience**,*24 (1), 1193–1216. [CrossRef] [PubMed]

*Neural Computation**,*60–103.

*, 7 (7), e39857. [CrossRef] [PubMed]*

*PLoS One*

*Neural Computation**,*24

*,*867–894. [CrossRef] [PubMed]

*, 11 (1), 89–123. [CrossRef]*

*Applied and Computational Harmonic Analysis*

*Journal of Neuroscience**,*30

*,*2102–2114. [CrossRef] [PubMed]

*Journal of Neuroscience**,*30

*,*6482–6496. [CrossRef] [PubMed]

*Neuron**,*47

*,*143–153. [CrossRef] [PubMed]

*, 17, 301–334. [CrossRef]*

*Network: Computation in Neural Systems*

*Journal of Neuroscience**,*20

*,*6594–6611. [PubMed]

^{1}Note that the binary mixture is an approximation to the full complexity of natural images because image patches for which the center and surround are independent may contain different structures in the surround (e.g., occlusions or adjacency or transparency involving multiple objects and textures). More complex models have addressed such heterogeneity (e.g., Karklin & Lewicki, 2009; Schwartz et al., 2006). Here instead we have chosen to explore the consequences of a much simpler, and biologically more plausible, implementation that can still capture a vast repertoire of effects of V1 surround modulation.

**k, s**

^{(θ)}) as a probabilistic mixture of two base distributions. In the section on models of statistical dependencies between V1 RFs, we introduced the two mixture components, in which RFs are defined by Equations 5 and 6, respectively, but we did not write down explicitly the MGSM distribution. To do so formally, we introduce a binary

*assignment*variable

*ξ*with prior distribution

*p*(

*ξ*), such that

*ξ = ξ*

_{1}corresponds to the first mixture component and

*ξ = ξ*

_{2}to the second; the joint distribution of

**k, s**

^{(θ)}can be written as

**k, s**

^{(θ)}, given the mixers, are Gaussian by definition of GSM, and the mixers are assumed Rayleigh distributed, which leads to the analytical solution of the integrals: where is the modified Bessel function of the second kind;

*m*=

_{k}*dim*(

**k**);

*m*=

_{s}*dim*(

**s**

^{(θ)});

*m*=

*m*+

_{k}*m*; the terms

_{s}*λ*(

**k**,

**s**

^{(θ)}) and

*λ*(

**k**) are defined in Equations 9 and 10 respectively; and

*p*(

*ξ*|

_{1}**k**,

**s**

^{(θ)}) that the surround shares a common mixer with the center is inferred (in the main text, we denoted it, for simplicity,

*q*(

**k**,

**s**

^{(θ)})); this is proportional via Bayes rule to Equation 16. Then, we compute the expected value of the Gaussian component

*κ̄*

_{θ}_{,Φ}

*κ̄*(which we relate to V1 neural responses via Equation 7). where the conditional expected values under

_{θΦ}*ξ*=

*ξ*

_{1}and

*ξ*=

*ξ*

_{2}are obtained directly by solving the integrals:

*p*(

*ξ*), and the covariances Θ

**C**

*,*

_{κσ}**C**

*,*

_{κ}**C**

*}. To find the optimal parameter values, we used a generalized expectation maximization algorithm. Expectation maximization amounts to maximizing a lower bound on the log-likelihood of the data (the RF outputs to a large collection of natural image patches as described in the section on models of statistical dependencies between V1 RFs) by iterating two steps until convergence: First, given the current parameters' values, approximate the true posterior distribution of the hidden variable*

_{σ}*p*(

*ξ*|

**k, s**

^{(θ)}); second, maximize the so called complete-data log-likelihood, namely, the expectation of log [

*p*(

**k, s**

^{(θ)},

*ξ*)] under the posterior of

*ξ*. In the E step, we use Bayes rule to compute the estimate,

*Q*, of the posterior distribution over

*ξ*, given the observed RF outputs and the previous estimates of the parameters, which we denote by

*p*(

*ξ*

_{1})

*and Θ*

^{old}*:*

^{old}*∂f*/

*∂p*(

*ξ*

_{1}) = 0, we obtain

*p*(

*ξ*

_{1})

^{★}

*f*] = Q(ξ

_{1}). For the other terms, we obtained analytical expressions for the gradient, e.g., (similar expressions hold for the other partial derivatives) but could not solve the maximization analytically and adopted a numerical procedure to maximize

*f*with reference to the covariance matrices. We used partial M steps for each covariance matrix separately (i.e., expectation conditional maximization).