The tilt illusion is a paradigmatic example of contextual influences on perception. We analyze it in terms of a neural population model for the perceptual organization of visual orientation. In turn, this is based on a well-found treatment of natural scene statistics, known as the Gaussian Scale Mixture model. This model is closely related to divisive gain control in neural processing and has been extensively applied in the image processing and statistical learning communities; however, its implications for contextual effects in biological vision have not been studied. In our model, oriented neural units associated with surround tilt stimuli participate in divisively normalizing the activities of the units representing a center stimulus, thereby changing their tuning curves. We show that through standard population decoding, these changes lead to the forms of repulsion and attraction observed in the tilt illusion. The issues in our model readily generalize to other visual attributes and contextual phenomena, and should lead to more rigorous treatments of contextual effects based on natural scene statistics.

*which*sensory stimuli are treated collectively, and

*what*consequences there are for each stimulus in being either aggregated with, or separated from, the others (e.g., Behrmann, Kimchi, & Olson, 2003; Wertheimer, 1923). One central assumption is that the logic of this organization depends on the statistics of natural images, through phylogenetic and/or ontogenetic programming (e.g., Attneave, 1954; Barlow, 1961; Brunswik & Kamiya, 1953; Elder & Goldberg, 2002; Geisler, Perry, Super, & Gallogly, 2001; Krüger, 1998; Sigman, Cecchi, Gilbert, & Magnasco, 2001). In this paper, we consider how the tilt illusion may arise as a downstream consequence of perceptual organization.

*away*from the orientation of a surround grating that forms its visual context ( Figure 1a, left). Under particular circumstances, one can also observe a weaker but consistently attractive effect (called the indirect tilt illusion) toward the surround orientation (Figure 1a, right). The tilt illusion is an appealing target for elucidating perceptual organization since its psychological (e.g., Goddard, Clifford, & Solomon, 2008; Wenderoth & Johnstone, 1988; Westheimer, 1990, and see Figures 1a and 1b) and neural (e.g., Cavanaugh, Bair, & Movshon, 2002; Felsen, Touryan, & Dan, 2005; Gilbert & Wiesel, 1990; Li, Thier, & Wehrhahn, 2000; Sengpiel, Sen, & Blakemore, 1997, and see Figure 1c) bases have been extensively probed. The illusion has also been a focus of theoretical interest (e.g., Clifford, Wenderoth, & Spehar, 2000; Gibson & Radner, 1937; Schwartz, Hsu, & Dayan, 2007; Series, Lorenceau, & Frégnac, 2003; Solomon & Morgan, 2006). Indeed, it has long been known that simple, mechanistic, changes to tuning curves at the neural level might underlie observed contextual biases in the tilt illusion (and also in aftereffects, a related phenomenon associated with adaptation, e.g., Jin, Dragoi, Sur, & Seung, 2005; Kohn, 2007; Kohn & Movshon, 2004; Teich & Qian, 2003). However, why the surround stimulus might lead to these tuning curve changes, and the relationship between this and the statistics of natural scenes, are not clear.

*learning*the parameters of the GSM model (e.g., as in Schwartz et al., 2006b). The new goal here is rather to study the implications of the GSM framework, for both the neural and perceptual levels. This work sets the basis for future studies incorporating learning (and adaptation) of the model parameters from natural scenes. We expect this framework to be applicable to studying a wealth of other neural and perceptual contextual data (e.g., Clifford & Rhodes, 2005; Li, 1999; Schwartz et al., 2007) in a more rigorous manner based on scene statistics.

*f*

_{ c}by a gain control signal

*γ*

_{gain}(Albrecht et al., 1984; Heeger, 1992):

*f*

_{c}=

*F*

_{c}·

*F*

_{c}, sometimes called its filter (

*c*standing for center), with the stimulus,

*γ*

_{gain}is determined by the rectified (or squared) feedforward activations of other units in center and surround locations. In most models,

*γ*

_{gain}also incorporates an additive constant to prevent the denominator from tending to zero. The output of the model is nonlinear due to both the rectification and division. This form of model is often denoted “divisive normalization” or “divisive gain control.”

*f*

_{c}of multiple units that arises because of the particular statistical structure of natural scenes. The main idea is that filters with similar preferences for orientation representing nearby spatial locations in a scene, have striking statistical dependencies; and that divisive gain control can reduce these statistical dependencies (e.g., Schwartz & Simoncelli, 2001b). Figure 2a shows an example of a natural scene taken from the Berkeley database (Martin, Fowlkes, Tal, & Malik, 2001, which we chose for reasons that will become clear later); and Figure 2b shows the characteristic form of the statistical dependency.

*f*

_{ c}, given the activation of a non-overlapping vertically oriented filter

*f*

_{ s}in a surrounding, contextual, location. The filter activations,

*f*

_{ c}and

*f*

_{ s}, are computed by convolving each receptive field tuning function (

*F*

_{ c}and

*F*

_{ s}) with a set of natural scenes (where the receptive fields are spatially displaced relative to one another), and accumulating joint statistics. The figure shows that the magnitudes of

*f*

_{ c}and

*f*

_{ s}are coordinated in a rather straightforward manner (Schwartz & Simoncelli, 2001b).

*hidden*in that they are not transparently evident in the observed activations of the filters; rather, a (nonlinear) operation must be applied to the image to estimate or recognize them. In our case, we consider a generative model with a particular functional form that gives rise to the observed statistical correlations, creating coordinated values equivalent to

*f*

_{c}and

*f*

_{s}(of Figure 2b). As we will see, the nonlinear recognition operation involves the equivalent of a form of divisive gain control. We propose that the resulting nonlinear model should form the basis for the neural unit model.

*l*

_{ c}and

*l*

_{ s}, that have the same form of statistical coordination as the filter activations

*f*

_{ c}and

*f*

_{ s}(which are observed in natural scenes between filters corresponding to a center and surround location, as in Figure 2b). According to the GSM:

*l*

_{ c}is given by multiplying together two independent random variables. One,

*g*

_{ c}, is drawn from a Gaussian distribution, and represents the local form at a point in the image. The other,

*v,*is positive, and is called a mixer. Similarly, the model generates

*l*

_{ s}according to:

*g*

_{ s}is the local form of the surround (another Gaussian variable, which we assume to be independent of

*g*

_{ c}), multiplied by the

*same*common mixer

*v*. Since the center and surround share a common mixer, the generated filter activations are statistically coordinated. For instance, the samples of

*g*

_{ s}and

*g*

_{ c}are independent, but if the corresponding sample of

*v*is low, the absolute values of the multiplications ∣

*l*

_{ s}∣ = ∣

*g*

_{ s}

*v*∣ and ∣

*l*

_{ c}∣ = ∣

*g*

_{ c}

*v*∣ will tend to be low together, and similarly if

*v*is high, these will have a larger probability of being high together. For many such samples, we plot the coordinated statistics of the magnitudes of

*l*

_{ c}and

*l*

_{ s}generated according to the GSM model ( Figure 2c). Note the similarity to the observed empirical statistics of the absolute filter activations ∣

*f*

_{ c}∣ and ∣

*f*

_{ s}∣ in Figure 2b.

*g*

_{ c}=

*g*

_{ s}=

*l*

_{ c}and

*l*

_{ s}by the common mixer variable, we are left with two Gaussian variables that are independent ( Figure 2d). We assume that the goal of a model unit in the center location is to perform exactly this operation, calculating

*g*

_{ c}by estimating and dividing out

*v*. This operation depends on the assumption that

*l*

_{ c}and

*l*

_{ s}are indeed statistically coordinated, an assumption that we will later consider in greater detail.

*g*

_{ c}directly and analytically, given the linear filter activations

*l*

_{ c}and

*l*

_{ s}. The expected value

*E*[

*g*

_{ c}∣

*l*

_{ c},

*l*

_{ s}] is taken to be the nonlinear response corresponding to a center model unit, and amounts to a form of divisive gain control. It is given by (Schwartz et al., 2006b, and see also 1 for detailed derivation):

*l*=

*k*is a small additive constant, which sets a minimal gain when the filter activations are zero, preventing the denominator from vanishing. We have thus far assumed a single surround filter; if there are more surround filters, then we add their squared activations to the gain. 1 gives the full form of Equation 4 and its derivation, including the constant of proportionality, which depends on

*l*and on

*n,*which is the number of center and surround filters comprising

*l,*but not directly on the individual values of

*l*

_{c}and

*l*

_{s}. As we will see in the Methods section, the expectation in Equation 4 will form the basis for our neural unit model.

*E*[

*g*

_{ c}∣

*l*

_{ c},

*l*

_{ s}] as a function of the surround activation

*l*

_{ s}. Specifically, we fix the parameters and the value of

*l*

_{ c}in Equation 4, and show how

*E*[

*g*

_{ c}∣

*l*

_{ c},

*l*

_{ s}] changes as a function of

*l*

_{ s}. Figure 3 show that the estimated model response

*E*[

*g*

_{ c}∣

*l*

_{ c},

*l*

_{ s}] decreases as

*l*

_{ s}increases. This results in surround suppression, whereby a stronger surround strength (and hence a stronger divisive gain control) decreases the response of the unit.

*l*

_{ s}only affects

*g*

_{ c}if center and surround share the same gain pool. In the model, we set the gain pool to provide the nature of coordination that is actually seen in natural scenes. Analysis of natural scene statistics suggests that filters of similar orientations at different spatial positions will be coordinated to a degree that depends on spatial and orientation differences (Schwartz & Simoncelli, 2001b), an expectation embodied in recent work on the GSM and related models (Karklin & Lewicki, 2005; Schwartz et al., 2006b). Here, we consider this scheme of perceptual organization, and its implications for neural processing and perception in the tilt illusion. We first assume that center and surround filters of the same orientation share a common mixer, and so are in the same gain control pool. Later we relax this assumption.

*ϕ*

_{ ci}is the preferred orientation of center unit

*i*in the population;

*θ*

_{ c}is a given center stimulus orientation, and

*ϕ*

_{ ci}−

*θ*

_{ c}is the circular difference (in modulo 180, since orientation is circular in 180 instead of 360 degrees) between the preferred orientation of the center unit and the center stimulus orientation. The term

*ω*(set to 22° in our simulations) determines the width of the tuning curves.

*l*

_{ ci}(as in Equation 5), as well as filters in the surround locations, whose linear front-end activations

*l*

_{ si}in response to the surround stimulus are determined as in Equation 5 according to the difference between the surround stimulus orientation and the surround filter preferred orientation:

*ϕ*

_{ si}is the preferred orientation of surround unit

*i*in the population;

*θ*

_{ s}is a given surround stimulus orientation, and

*ϕ*

_{ si}−

*θ*

_{ s}is the circular difference between the preferred orientation of the surround unit and the surround stimulus orientation. The width of the surround tuning

*ω*is set to 22°, as for the center tuning function. We simplify the treatment of the surround, notably by ignoring its spatial extent and parameterizing surround units only by their preferred orientations. We assume that there are multiple surround filters with preferred orientation

*ϕ*

_{ si}, each responding equally to the surround stimulus. Inclusion of more surround filters in the simulations will lead to stronger surround influence on the nonlinear gain control. We also assume, based on the statistical coordination described above, that the gain control pool for unit

*i*is set by center and surround filter activations with the same orientation preference

*ϕ*

_{ si}=

*ϕ*

_{ ci}.

*i*in the population is determined by the divisive gain control process. Specifically, the output for each center unit

*i*is given by the GSM model of Equation 4, resulting in estimates

*E*[

*g*

_{ ci}∣

*l*

_{ ci};

*l*

_{ si}]. For convenience, we re-write this equation for each unit

*i*in the population, this time including the constant of proportionality (see 1):

*l*

_{ ci}and

*l*

_{ si}are given by Equations 5 and 6; the divisive term is

*l*=

*n*= 2 (because we assume 1 center filter; and the influence of the surround filters is set to 1 is our simulations, which could be thought of as multiple filters with an overall weighting of 1); the additive constant is set to

*k*= 0.125; and the last factor in the product (in brackets) involves what is known as a modified Bessel function of the second kind (see also Grenander & Srivastava, 2002).

*i*in Equation 7 as simply

*g*

_{ i}, thus dropping the expected value and the

*c*subscript corresponding to center (since the perceptual task only demands decoding at this center location).

*ω*; the number of center and surround filters

*n*; and the additive constant

*k*. These parameters affect the strength of the results, but not their qualitative nature.

*g*

_{ i}of the population of center units into an estimate of the angle of the center stimulus, we use a standard population vector decoding scheme (Georgopoulos et al., 1986). According to this, the center angle is given by

*ϕ*

_{i}is the preferred angle of unit

*i,*and

**u**(

*ϕ*) is a two dimensional unit vector pointing in the direction of

*ϕ*(and the doubling takes account of the orientation circularity).

*l*

_{ si}= 0) affecting the linear activation of these units (from Equation 5) are the same.

*X*axis). With the surround, the response is asymmetric about 20° because of the suppressive effects of the gain control on the units near 0°. Using population vector decoding, the inferred angle is 22.4° (filled, black arrow on the

*X*axis). The difference between the perceived and presented angle of the center stimulus is therefore 2.4°, repulsively biased away from the 0° surround. This realizes the direct tilt illusion (e.g., Wenderoth & Johnstone, 1988).

*within*); the statistics follow the same correlation structure as in Figure 2b. By contrast, Figures 7d and 7e show the statistical relationship between vertically-oriented filters taken from patches of scenes

*across*segmentation boundaries. It is apparent that the coordination is greatly reduced, implying that the surround filter does not provide evidence about the mixer (

*v*) for the center, and therefore should not form part of the gain pool for the center. This statistical coordination within versus across segments in natural scenes has not been previously shown empirically, nor incorporated in neural models. As before, we have included in the caption of Figure 7 the technical details for creating the scene statistics figure, but these are not critical for understanding the methods for the simulations which follow below.

*l*

_{ci}and

*l*

_{sj}. However, the basic effect, which we use in a simple and abstract form, is the obvious one that the closer the center and surround angles, the more likely they are to be part of the same visual object. Figures 7c and 7f show a coarse version of this distribution for the

*within*and

*across*conditions taken from the Berkeley segmentation data themselves.

*i*with preferred angle

*φ*

_{ ci}representing the center stimulus, treats a surround stimulus of angle

*θ*

_{ s}as being part of the same visual object with probability:

*λ*is a parameter that controls the steepness of this selection (set to

*λ*=

*i*would be

*E*[

*g*

_{ ci}∣

*l*

_{ ci}], given just the observation at the center

*l*

_{ ci}(i.e., the gain signal is only comprised of the center filter activation

*l*

_{ ci}). If center and surround are coordinated, then, as before, the activity would be

*E*[

*g*

_{ ci}∣

*l*

_{ ci},

*l*

_{ si}], given both observations (i.e., the gain signal is given by both the center filter activation

*l*

_{ ci}and the activation of the surround filters

*l*

_{ si}). We assume that the net response is just the average of these two means, weighted by their respective posterior probabilities:

*p*set according to Equation 9. As before (compare to Figure 5b), the presence of the surround stimulus reduces the peak height of the tuning curve when the preferred orientation is equal to that of the surround. However, when the preferred orientation is farther from the surround, the response is actually enhanced.

- the center units share gain pools with surround units having the same preferred orientations;
- when the surround stimulus is far from the center stimulus, it only weakly activates the surround units with similar preferred orientations to the most activated center units;
- critically, if a surround unit has a weak activation and is in the gain pool for a center unit, then it provides evidence that the common mixer variable
*v*is small, and that therefore the*g*_{ ci}underlying*l*_{ ci}is large.

*v*for the center, and therefore does not boost

*g*

_{ ci}. This boosting therefore waxes as the difference between center and surround orientations grows, but then wanes, since the weighting term,

*p,*gets smaller when the difference is very large. Although this trend has not been documented systematically in neural data, boosting for far angle separations of center and surround has been observed physiologically (e.g., Levitt & Lund, 1997). Interestingly, the mechanistic disinhibition model of (Dragoi & Sur, 2000) shows a similar trend of waxing and waning.

*X*axis). With the surround, the response is asymmetric about 70° (but in exactly the opposite direction of Figure 6a), leading to an inferred angle of 69.41° (filled, black arrow on the

*X*axis). The difference between the perceived and presented center angle is now −.59°. This attraction to the surround orientation is an example of the (weaker) indirect tilt illusion (e.g., Wenderoth & Johnstone, 1988).

*p*in Equation 10 leads to a stronger waxing and waning of tuning curve height for surround orientations far from the preferred orientation of the neuron (of Figure 8), and thus to stronger attraction. Also, the orientations at which these happen are influenced by the widths of the idealized Gaussian tuning curves in Equations 5 and 6 (which, as noted in the Methods section, were chosen as 22° in the simulations). Changes in the tuning widths themselves due to the GSM model are more pronounced as the additive constant

*k*in Equation 7 decreases. We do not obtain shifts in tuning curve, although repulsive shifts could arise by incorporating mechanisms similar to the neural adaptation model in (Wainwright, Schwartz, & Simoncelli, 2002). The strength of the bias in attraction and repulsion is also influenced by the choice of decoding method (Jin et al., 2005). Although the qualitative results are similar, as indeed might be expected, a decoding rule based on the maximally activated unit in the population leads to stronger biases than the population vector method.

*n*= 3 and

*k*= .2 (see also Figure 3, dashed line); center and surround tuning widths of 20° and 27° respectively; and

*λ*=

*X*axis from repulsion to attraction is different. We account for this in the model with the change in tuning widths of the front-end center and surround tuning functions.

*why*these tuning curve changes occur in the face of contextual stimuli. Functional models of tilt perception have included variants of efficient coding (e.g., Bednar & Miikkulainen, 2000; Clifford et al., 2000; Wainwright, 1999) and Bayesian approaches (e.g., Dayan et al., 2003; Stocker & Simoncelli, 2006), some, but not all, of which can accommodate attraction as well as repulsion. Repulsion and attraction have also been modeled in other domains, such as disparity (e.g., Lehky & Sejnowski, 1990). The main differences between our framework and previous functional modeling approaches are that (a) our model is based on a modern treatment of statistical coordination in natural scene statistics, whereas few previous frameworks have been tied to natural scene statistics models at all; and (b) our model encompasses both the neural unit level tuning curve changes and the effects of these on perception.

*μ*= 0 and standard deviation

*σ*(where, for simplicity, we have set

*σ*= 1 throughout the main text and in the simulations).

*g*

_{ c}and

*g*

_{ s}; and activations

*l*

_{ c}and

*l*

_{ s}. Here, for convenience, we develop the more general formulation. We assume there are

*n*filters in the gain pool, with activations

*l*

_{1}…

*l*

_{ n}. We label the local Gaussian components

*g*

_{1}…

*g*

_{ n}. The single mixer variable is given by

*v*. We would like to estimate the distribution

*P*[

*g*

_{1}∣

*l*

_{1}…

*l*

_{ n}]; and the mean

*E*[

*g*

_{1}∣

*l*

_{1}…

*l*

_{ n}].

*P*[

*g*

_{1}], is just a Gaussian distribution, given by Equation A2. The second term is given by

*P*

*g*

_{1}is now fixed in Equation A3. Including the constant of proportionality, such that the distribution sums to 1, we obtain:

*g*

_{1}…

*g*

_{ n}are mutually independent. Therefore, the values of the other filter responses

*l*

_{2}…

*l*

_{ n}only provide information about the underlying hidden variable

*v*. This results in:

*n*= 1 case. Estimating the third term we obtain:

*l*=

*n, l*), depends on the number of filters

*n*and on

*l*.

*l*

_{ c}, corresponding to the center; and the other

*n*− 1 surround filters (all with equal activations)

*l*

_{ s}. Using this formalism, Equation A11 becomes:

*l*=

*k,*a small constant, which sets a minimal gain when the filter activations are zero. The constant k acts as a placeholder for model error. In practice, as in divisive gain control frameworks, this constant is a free parameter in our model. In addition, as noted above, we set

*σ*= 1 in all simulations.