Abstract
Biological vision systems are adept at combining cues to maximize the reliability of object boundary detection, but given a set of colocalized edge detectors operating on different sensory channels, how should their responses be combined to compute overall edge probability? To approach this question, we collected joint responses of redgreen and blueyellow edge detectors both ON and OFFedges using a humanlabeled image database as ground truth (D. Martin, C. Fowlkes, D. Tal, & J. Malik, 2001). From a Bayesian perspective, the rule for combining edge cues is linear in the individual cue strengths when the ONedge and OFFedge joint distributions are (1) statistically independent and (2) lie in an exponential ratio to each other. Neither condition held in the color edge data we collected, and the function P(ON∣cues)—dubbed the “combination rule”—was correspondingly complex and nonlinear. To characterize the statistical dependencies between edge cues, we developed a generative model (“saturated common factor,” SCF) that provided good fits to the measured ONedge and OFFedge joint distributions. We also found that a divisive normalization scheme derived from the SCF model transformed raw edge detector responses into values with simpler distributions that satisfied both preconditions for a linear combination rule. A comparison to another normalization scheme (O. Schwartz & E. Simoncelli, 2001) suggests that apparently minor details of the normalization process can strongly influence its performance. Implications of the SCF normalization scheme for cue combination in biological sensory systems are discussed.
Introduction
Detecting object boundaries is one of the central problems of natural vision. The problem is all the more pressing given that some of the most important objects—predators and prey—are often well camouflaged. Even without intentional camouflage, an object's surface properties can accidentally match those of its background, causing border contrast to weaken or disappear. The problem is compounded by poor lighting conditions, shadows, haze, etc. To combat these problems, biological visual systems are adept at combining cues from multiple sources, including depth, motion, luminance, color, and texture, to maximize the reliability of boundary detection under the widest variety of circumstances (Rivest & Cavanagh,
1996; Badcock & Westheimer,
1985; Gray & Regan,
1997; McGraw, Whitaker, Badcock, & Skillen,
2003).
The problem of cue combination in the context of boundary detection may be simply stated: Given a set of N local edge detectors applied at a given location in the scene, each tuned to a different sensory cue, how should their responses be combined mathematically to arrive at the ONedge probability P(ON∣cues)?
Within a Bayesian framework, optimal cue combination depends heavily on the statistics of natural scenes (Grenander,
1996; Knill,
1998; Mamassian, Landy, & Maloney,
2002; Grzywacz & Yuille,
1990), and in the particular case of edge detection, on the conditional joint distributions
P(cues∣ON) and
P(cues∣OFF). Previous authors have collected ONedge and OFFedge conditional distributions from natural images. Using an adaptive binning method, Konishi, Yuille, Coughlan, and Zhu (
2003) measured the joint statistics of brightness and color gradients both on and off edges in a humanlabeled image database, and then used Bayes' rule to compute the empirical posterior probability
P(ON∣cues) at every point in an image. The method yielded good edgedetection results. Fine, MacLeod, and Boynton (
2003) gathered joint statistics of the luminance, redgreen, and blueyellow color differences between pixels in natural images as a function of their spatial separation. By adopting the assumption that nearby pixels most often lie on the same “stuff,” whereas pixels taken from different images most likely lie on different materials, they estimated samesurface vs. differentsurface conditional distributions of their colordifference measurements using unlabelled images, analogous to the collection of ONedge and OFFedge conditional statistics as discussed above. Then using Bayes' rule, they derived an empirical cuecombination rule that computed the probability that two pixels lie on the same surface using differences in the three color channels as cues. The surface locality assumption and the Bayesian approach were validated in the sense that the resulting cuecombination rule provided fairly good predictions of human subjects' performance on an image segmentation task.
Direct tabulation of boundary statistics to arrive at an optimal cuecombination rule, as in the abovementioned approaches, has the advantage of simplicity, and involves few assumptions. The main disadvantage is that an everlarger space of boundary cue joint statistics must be tabulated each time a new cue is added to the mix. Furthermore, tables of statistics do not increase our understanding of the origin and statistical properties of natural boundaries, nor of the functional form of the optimal combination rule. It is difficult, therefore, to exploit such an approach to make inferences about the circuitry and operations that biological sensory systems may use to combine cues.
Overview of the approach
Given the natural ties between cuecombination theory and natural image statistics, in the next two sections we review ideas and methods from both to frame the problem of optimal boundary detection in natural images. We turn first to cuecombination theory, to learn what makes cue combination easy. The key insight is that when classifying a cue vector into categories (e.g., ONedge vs. OFFedge), classconditional independence of the cues greatly simplifies the process by satisfying the most demanding precondition for a linear combination rule. We then turn to previous work in natural image statistics, from which we learn not to expect colocalized edge cues to be classconditionally independent—far from it. We nonetheless gain insight from this work as to why colocalized cues are contaminated with higherorder correlations, and what kinds of (divisive normalization) operations can help to rid cues of their undesirable dependencies.
In the body of the work we describe our new contributions as follows: (1) We measure in a controlled setting the joint statistics of two colocalized edge cues operating in different color channels; (2) we propose a simple generative model (called “saturated common factor,” SCF) that helps explain the specific form of the joint distributions we see and especially the higherorder correlations; (3) we invert the SCF model mathematically to arrive at a simple, biologically plausible divisive normalization scheme; (4) we test the SCF normalization scheme on the labeled database and find that the normalized edgedetector responses are not only classconditionally independent but also satisfy the other critical condition for a linear combination rule; (5) we compare the SCF normalization scheme to another scheme that is similar in spirit but based instead on a sumofsquares normalization term (Schwartz & Simoncelli,
2001), finding that the SCF normalization is more effective at eliminating the higherorder dependencies between edge detector channels; and (6) we point out the relationships between the SCF and other similar models and flesh out interpretations and predictions pertaining to the neural circuitry needed for sensory cue combination.
When a linear combination rule is best
Previous approaches to cue combination have often included assumptions that lead to a linear combination rule (Yuille & Bulthoff,
1996; Ernst & Banks,
2002; Olman & Kersten,
2004), such as the assumption that the cues are classconditionally independent (CCI) Gaussian variables. In the context of two edge detectors operating separately in redgreen and blueyellow color channels, the CCI assumption means that (1) in cases where an edge is present, the magnitude of the redgreen change across the boundary provides no information about the magnitude of the blueyellow change, and (2) the same holds in cases when there is no boundary present. (Note that when the CCI assumption holds for a set of cues, overall statistical independence between the cues does
not hold, since a large response measured within any individual channel increases the probability that a boundary is present, which leads to higher expected values for all of the other boundary cues.)
Given multiple estimators
$ s ^$
_{ i} of a continuous target variable
S, the CCI Gaussian assumption leads to a cuecombination rule that is linear, with weights depending on the individual cue “reliabilities”:
where
and
δ is the variance of distribution
P(
S∣
$ s ^$
_{ i}) (Yuille & Bulthoff,
1996). Similar results can be obtained for cues {
r_{1},
r_{2},…
r_{N}} that provide evidence for a binary target variable
T. For example,
T might indicate the presence or absence of an edge at a given image location/orientation. As is the case for the Perceptron model (Bishop,
1996), Bayes' rule tells us that
Using the assumptions that the cues are CCI Gaussian random variables, with each cue having the same variance in class
T and
$ T \u2015$
(despite different means), it can be shown that
where the
w _{ i}'s are related to cue variances and
f(
x) is a logistic function. It is easy to see that the optimal combination rule remains linear even under the weaker assumption that
P(
r _{1},
r _{2}, … ∣
T) and
P(
r _{1},
r _{2}, … ∣
$ T \u2015$
), though potentially violating the CCI condition, are arbitrary nonspherical Gaussian distributions with the same covariance matrix (Jacobs,
1995). This is because a single linear transformation of the cues, built into the combination rule (
Equation 4), can eliminate the dependencies in both distributions simultaneously.
We observed that a linear combination rule still applies for CCI sensory cues, even if nonGaussian, as long as the ratio of the classconditional distributions for each individual cue is a decaying exponential function of the cue's value:
This is the case since under the CCI assumption,
where the
α _{ i}s are constants. Plugging
Equation 6 into
Equation 3, we have
where
f is again a logistic function. This is the familiar “neural” operation consisting of a weighted sum of inputs followed by a sigmoidal activation function. A number of simple classconditional distributions satisfy the ratio requirement of
Equation 5, including any two Gaussian distributions with the same covariance (the special case mentioned above) and any two exponential distributions. Note that the exponential ratio assumption of
Equation 5 is equivalent to the assumption that each cue by itself has a sigmoidal target function:
where
A is a constant.
In addition to their simplicity from a computational perspective, linear combination rules have the advantage that they make it particularly easy to incorporate new cues as they become available. Linear combination rules account for human performance in a variety of visual tasks, including orientation discrimination (Rivest, Boutet, & Intriligator,
1997), estimation of depth (Landy, Maloney, Johnston, & Young,
1995), luminance (Maloney,
1999,
2002; Maloney & Yang,
2004), motion (Derrington & Badcock,
1985; Ledgeway & Smith,
1994; ScottSamuel & Georgeson,
1999; Wilson & Kim,
1994), texture localization (Landy & Kojima,
2001), and surface orientation (Knill,
1998). Results inconsistent with linear combination rules have also been found (Bulthoff & Mallot,
1998; Porrill, Frisby, Adams, & Buckley,
1999; Saunders & Knill,
2001; Frome, Buck, & Boynton,
1981), telling us that biological systems are certainly capable of managing more complex combination schemes when necessary.
About the classconditional independence assumption: It's not likely to hold
The optimal rule for combining cues is simplest when the detectors to be combined are class conditionally independent. However, previous work on the statistics of natural images tells us that nearby filter responses often show pronounced higherorder correlations that cannot be eliminated by a linear operation such as a whitening transform (Wainwright & Simoncelli,
2000; Schwartz & Simoncelli,
2001; Liang, Simoncelli, & Lei,
2000; Zetsche & Röhrbein,
2001; Parra, Spence, & Sajda,
2000; Fine et al.,
2003; Karklin & Lewicki,
2003,
2005). In particular, given that a strong response is measured in one filter, other nearby filters typically show larger variances. This “bowtie”shaped dependency can arise from several factors that scale filter response distributions up and down on a regionbyregion basis, including lighting factors (e.g., variations in light intensity and/or the angles between surfaces and light sources), variations in surface texture across the scene, and atmospheric conditions such as mist or haze that reduce the contrast of more distant objects.
A basic assumption of this work is that regional common factor–induced variability in filter responses is undesirable because it conflates with the more important filter variations tied to the underlying physical structure of the scene. Various authors have shown that higherorder correlations between nearby filters can be suppressed by divisively normalizing filter responses using a locally computed energylike measurement. Wainwright and Simoncelli (
2000) proposed a Gaussian Scale Mixture model of image formation, in which contrast factors multiplicatively scale the original Gaussiandistributed wavelet coefficients. The process is reversed by divisive normalization, allowing the raw sensory variables to be recovered. Parra et al. (
2000) proposed a similar model in which filter coefficients are drawn from a spherical random distribution whose overall scale is modulated, and Zetsche and Röhrbein (
2001) proposed a divisive normalization scheme based on the observation that natural signals are often separable in polar coordinates. More recently, Karklin and Lewicki (
2003,
2005) showed that the multiplicative factors operating on filter responses can themselves be viewed as spatially varying variance images, and efficiently encoded using learned basis functions tailored to specific image contexts.
It is interesting to note that the divisive normalization schemes associated with these analyses of natural image statistics are very similar, and in some cases identical, to the divisive operations proposed to implement contrast gain control in neural circuits (Reichardt & Poggio,
1979; Ohzawa, Sclar, & Freeman,
1982; Bonds,
1989; Nelson,
1991; Albrecht & Geisler,
1991; Geisler & Albrecht,
1992,
1997; Heeger,
1992; Carandini, Heeger, & Movshon,
1997; Bonin, Mante, & Carandini,
2005).
Given that none of these previous studies were primarily concerned with classification tasks, such as edge detection using optimal combinations of filter values, they focused on overall rather than classconditional dependencies among filter values. Nonetheless, based on these earlier studies we hypothesized that our color edge data set would exhibit a similar type of higherorder dependency in the classconditional (ONedge and OFFedge) distributions, and that some form of divisive normalization would be needed to eliminate these dependencies as a precursor to the cuecombination stage.
Results
Color edge statistics
The color opponent space
Three hundred images from the Corel database for which human segmentations were available (Martin, Fowlkes, Tal, & Malik,
2001) were used to gather color edge statistics (
Figure 1). Just as for cone responses in natural scenes (Ruderman, Cronin, & Chiao,
1998), the RGB values in the image database were highly correlated (
Figure 2A). Since this redundancy, left unchecked, would propagate directly through to the color edge statistics, we ran a fast fixedpoint ICA algorithm (Hyvarinen & Oja,
1997) on a random sample of 1.5 million pixels from the database. This led to an uncorrelated coloropponent space of familiar form (Wandell,
1995; Ruderman et al.,
1998) with
O_{1} and
O_{2} corresponding to redgreen and blueyellow opponent axes, respectively, and
O_{3} related to pixel intensity:
with scaling factors
introduced to simplify the matrix.
Data in the two coloropponent channels clustered near the origin (
Figure 2B). A histogram equalization step (
Figure 2C) was used to spread the two coloropponent values uniformly from 0 to 1 on their respective axes (
Figure 2D). In the remainder of the work, we focus on edge statistics within the normalized RG and BY color channels shown in
Figure 2D.
Computing oriented edges
To suppress JPEG artifacts, a Gaussian filter (
σ = 2 pixels) was used to smooth the coloropponent images (
Figures 3A and
3B). A custom “pairwise difference” (PD)oriented edge detector (
Figure 3C) was applied separately to the smoothed RG and BY color channels, giving rise to raw edge responses
r _{1} and
r _{2}, respectively, at each pixel (
Figure 3D). A second type of edge detector based on an oriented Gabor filter was also used in some experiments (
Figure 11).
Statistics of RG and BY edges
In agreement with previous measurements of spatial contrast values in natural images (
Figure 4A) (Bell & Sejnowski,
1995; Balboa & Grzywacz,
2003; Wainwright, Simoncelli, & Willsky,
2000), the marginal distributions of
r_{1} and
r_{2} were Sshaped on a log scale (
Figure 4B).
Humanlabeled image contours (Martin et al.,
2001) were used to sort PD responses into ONedge and OFFedge classes (see
Figure 1 caption for details), and the classconditional distributions were collected (
Figure 5). Despite the fact that the RG and BY values at each pixel were nearly uncorrelated (
r = 0.08, see
Figure 2D), the values of
r_{1} and
r_{2} were moderately correlated within both ONedge and OFFedge classes (
r = 0.36 and 0.43, respectively). Furthermore, as expected from previous studies (Wegman & Zetzsche,
1990; Wainwright & Simoncelli,
2000; Schwartz & Simoncelli,
2001; Parra et al.,
2000; Zetzsche & Röhrbein,
2001), the variance of either variable increased with the value of the other (
Figure 5, lower panels).
A second feature common to the two joint distributions was the transition of contour shapes from relatively straight diagonal contours near the origin, to round contours in the intermediate ranges, to square contours far from the origin (
Figures 5, upper panels).
A generative model
We searched for a simple generative model that would (1) help explain how colocalized edge detectors operating in independent sensory channels could give rise to the particular form of correlated joint distributions shown in
Figure 5, and (2) provide an avenue to recover the underlying edge variables, which we presumed to be classconditionally independent. We proposed that the responses
r _{1},
r _{2}, …
r _{N} of a collection of edge detectors applied at image location (
x,
y) and orientation
θ are generated by a fourstep process (
Figure 6):

P(ON) denotes the prior probability that a physical edge exists at any given location/orientation, with P(OFF) = 1 − P(ON). In the first stage of the generative model, the ON or OFF class is chosen at random according to this prior probability, measured to be 3% in the labeled images.

If the ONedge class is selected, the raw edge magnitudes
e _{1}, …
e _{N} in the
N sensory channels are drawn independently from an exponential distribution with mean
that is,
Edge magnitudes evaluated at OFFedge sites are
also assumed to be independent across sensory
channels and exponentially distributed, though with
a smaller mean value
This difference in expected value reflects the fact that
human contour labels were generally well aligned
with local edge structures; the larger mean values for
ONedge responses based on human labeling can be
clearly seen in
Figure 5.

A “common factor”
C representing the local lighting conditions, texture, or other modulatory influence is drawn from a third exponential distribution with mean 〈
C〉 = 1/
p, giving
The factor
C multiplies the raw edge strengths in all
N sensory channels to give the scaled sensory
responses

Finally, the scaled sensor responses are passed through a saturating nonlinearity with knee
K to give the measured edge detector responses
By deriving an expression for the cumulative joint distribution of
r _{1} and
r _{2} from this “saturated common factor” model, and computing the partial derivative evaluated using either
q =
q _{ON} or
q _{OFF}, we arrived at a parameterized expression for the classconditional joint distributions
P(
r _{1},
r _{2}∣ON) and
P(
r _{1},
r _{2}∣OFF) (see “
18” section). Maximum likelihood fits of the model to the empirical joint distributions of
Figure 5 are shown in
Figure 7. The modelgenerated distributions show both the gradual transitions from diagonal to square contours and the increasing variance of the conditional distributions.
Surprisingly, only a single parameter was needed to generate each of the plots of
Figure 7. Whereas each twodimensional plot nominally depends on the three parameters
p,
K, and either
q _{ON} or
q _{OFF}, the SCF model is sensitive only to the product
p*
q*
K (see “
19” section). It is thus possible to fix both
p and
K to 1 and maximize the likelihood of the data varying only
q. The values of
q _{ON} and
q _{OFF} found by this procedure are shown in
Table 1.
P(ON) was determined by estimating the fraction of pixels covered by human labels in the image database, and played no role in the fits shown in
Figure 7.
Table 1 Parameters of the saturated common factor model used to generate the distributions in
Figure 7.
Table 1 Parameters of the saturated common factor model used to generate the distributions in
Figure 7.
Description  Parameter  Value 
Prior probability of edge  P(ON)  0.03 
Mean raw ONedge response 〈 e _{ON}〉  1/ q _{ON}  0.563 
Mean raw OFFedge response 〈 e _{OFF}〉  1/ q _{OFF}  0.086 
Mean of common factor 〈 C〉  1/ p (fixed)  1 
Knee of sensor saturation function g()  K (fixed)  1 
The optimal combination rule
Given the prior probability
P(ON) and the joint distributions of colocalized edge detector responses for both ONedge and OFFedge classes, the combination rule
P(ON∣
r _{1},
r _{2}) follows directly from the application of Bayes' rule (
Equation 3). Contour and surface plots of the combination rule derived from the empirically tabulated likelihood functions (
Figure 5) are shown in
Figure 8A. For comparison, the combination rule derived from the modeled likelihoods (
Figure 7) is shown in
Figure 8B. The two functions show a similar progression of contour shapes moving away from the origin, reminiscent of the progression of contour shapes in the classconditional joint distributions shown in the underlying likelihood functions. One notable feature of the combination rule is that near the origin, where the two cues are both weak, the more diagonally oriented contours mean the edge probability is closer to a function of the sum (or average) of the two cues, whereas the squareshaped contours far from the origin indicate that the edge probability there is governed roughly by the MAX of
r _{1} and
r _{2}. That fact that the combination rule expressed in terms of the unprocessed edge detector responses
r _{1} and
r _{2} defies simple description is in keeping with the fact that the raw edge values satisfy neither of the preconditions for a linear combination rule of the form given in
Equation 7. We conjectured that an appropriate normalization of
r _{1} and
r _{2}, obtained by inverting the SCF model, could recover the two exponentially distributed underlying physical edge variables
e _{1} and
e _{2} if they exist, that is, if the edge responses obtained from the humanlabeled images were in fact generated by an SCF–like process.
Recovering the CCI edge variables:
First expand, then divide
Given the form of the SCF model, recovery of the underlying variables
e _{ i} from measured edge detector responses involves two steps. First, each variable is expanded through the function
h(
r) =
g ^{−1}(
r) =
Kr/(1 −
r), with
K = 1, to undo the effect of the compressive nonlinearity that we assume has acted on the edgedetector outputs. Applying the expansive nonlinearity unbends the contours found in the classconditional joint distributions of
r _{1} and
r _{2}, giving rise to the straight diagonal contours in the joint distributions of the intermediate variables
R _{1} and
R _{2} (
Figure 9A). The contours of
P(
R _{1},
R _{2}∣OFF) are not, however, equally spaced as they would be in a log plot for two independent exponential variables. Thus, the expansive nonlinearity leads to a simpler relationship between the two edge detector values, but does not remove their higherorder statistical dependencies.
The second step in inverting the SCF model is to divide the intermediate variables
R _{ i} by the most probable common factor, approximated as
where
N = 2 is the number of available cues (see “
20” section for derivation of
Equation 17). This approximation leads in turn to an approximation of the most probable distal edge value
For analytical tractability,
Equation 17 approximates the most probable value of
C given
R _{1} and
R _{2} rather than its (presumably larger) mean value. To adjust for the underestimate, we added a constant
ɛ = 0.5 to the denominator of
Equation 18. The primary effect of the constant was to expand the range of {
e _{1},
e _{2}, …} tuples produced by the normalization to include points near the origin. Dividebyzero errors that occur when
r _{1} =
r _{2} = 0 were also avoided.
Equation 18 is similar in spirit to previously proposed normalization schemes, in that the raw filter responses are first passed through an expansive nonlinearity and then divisively normalized by a term involving a sum of the expanded variables. The details are different, however. In particular, the function
h(
r) appears in lieu of the more common squaring nonlinearity (Heeger,
1992; Bonin et al.,
2005; Wainwright & Simoncelli,
2000; Parra et al.,
2000; Schwartz & Simoncelli,
2001; Zetzsche & Röhrbein,
2001); the divisive factor here is also sensitive to the number of available cues through the value of
N.
The common factor can be rewritten as
This form makes it clear that when
N grows large, the most probable common factor simplifies to
This means that when many independent sensory cues are available, the average of the expanded cues
$ R \u2015$
converges to the common multiplier, up to the constant of proportionality
q _{OFF}. When the number of cues is small, however, the sparse prior for
C causes
$ C ^$
to grow sublinearly with increasing values of
$ R \u2015$
.
Testing the normalization
To test the quality of the estimates of
C (via
Equation 17) when only two color edge values are available, two channels of data (
r _{1} and
r _{2}) were generated by drawing from the SCF model using the parameters of
Table 1. A scatter plot of
C vs.
$ C ^$
is shown in
Figure 9B. The correlation between estimated and actual values of
C was
r = 0.75. When a sumofsquares normalizer (
Equation 23) was used in lieu of
Equation 17, estimates of
C were degraded (
r = 0.46), though this was partly to be expected given that the data in this case were
known to be generated by the SCF model.
A more demanding test of the normalization formula would involve edge data from natural images, where the underlying generating process is unknown. Again using the estimate of
C given by
Equation 17, we tested whether the normalization of
Equation 18 would lead to CCI exponentially distributed edge values
e _{1} and
e _{2}—a premise of the SCF model. The joint distributions of the postnormalization edge variables are shown in
Figures 9C–
9F. The roughly evenly spaced diagonal contours (
Figures 9C and
9E) and the nearly overlapping conditional probability slices (
Figures 9D and
9F) confirm that the higherorder correlations present in the measured filter values
r _{1} and
r _{2} have been mostly eliminated, and that the resulting CCI edge variables are approximately exponentially distributed. As can be seen by the greater residual dependencies in the ONedge data (compare panels D and F in
Figure 9), the parameters of
Table 1 were optimized to model OFFedge statistics since those data accounted for 97% of the total data set.
Given that the two requirements for a linear combination rule are approximately met by
e _{1} and
e _{2}, that is, classconditional independence and an exponential ratio of the two class conditional distributions, we expected from
Equation 7 that the combination rule derived empirically from the normalized edge variables
e _{1} and
e _{2} would be a linear combination of the two variables passed through a sigmoidal output function
where the
e _{ i} are computed as in
Equation 18, and assuming the cues have equal means both ON and OFF edges, the sigmoid steepness and threshold parameters would be
The diagonal, sigmoidally spaced contours of the empirical combination rule shown in
Figure 10 confirm this.
Comparison to another divisive normalization scheme and generality
We asked whether the details of the SCF normalization scheme are critical for eliminating the higherorder statistical dependencies between colocalized edge detectors, or whether any similar scheme would perform in a roughly equivalent way. We also asked whether good performance of the SCF normalization scheme might be tied to the particular edgedetection filter we used up to this point—the PD filter—or whether the scheme would also work well when using a more conventional edge detecting filter.
Previously proposed schemes for divisive normalization of sensory signals have often involved an energylike computation in which measured filter values are squared and summed over a neighborhood, in effect a measure of local contrast (Heeger,
1992; Parra et al.,
2000; Schwartz & Simoncelli,
2001; Zetzsche & Röhrbein,
2001; Carandini et al.,
1997; Bonin et al.,
2005). For example, Schwartz and Simoncelli (
2001) proposed a normalization of the form
in which the
w_{j} and
σ are constants. We compared the SCF normalization with the SchwartzSimoncelli (SS) equation, which we extended to allow for an arbitrary exponent
k:
Lacking a likelihood function comparable to that derived for the SCF model, the parameter
k was optimized by systematic search with visual examination of the results. Results for SS normalization applied to the PD edgedetector data set are shown in
Figure 11A. In comparison to
Figures 9C–
9F, the postnormalized ONedge and OFFedge distributions using the SS normalization show greater residual higherorder dependencies, and lead to a combination rule with pronounced nonlinear interactions between the normalized edge variables. A similar pattern is seen when comparing the SCF and SS normalization schemes using a Gabor filter as the edgedetection operator (
Figure 11B).
Overall, the SCF normalization leads to a simpler pattern of results, including a more complete elimination of the higherorder dependencies, a nearly exponential appearance of the postnormalized variables, and an approximately linear combination rule. This suggests that the SCF model, by assuming underlying exponentially distributed variables, an exponentially distributed common factor, and a compressive nonlinearity applied separately to each channel, provides a fairly good representation of the process leading from physical edges to measured edge detector responses.
Color edge detection using the SCF normalization
Examples of images processed using the SCFderived cuecombination scheme (
Equations 18,
21,
22, and
Figure 8C) are shown in
Figure 12. All three color channels were used (i.e., including the intensity channel). The lightness value at each pixel is tied to the maximum value of
P(ON∣
r _{1},
r _{2},
r _{3}) over the 8 cuecombined orientation channels analyzing that pixel.
Discussion
The ability to make optimal use of available cues for the detection of object boundaries is critical for survival. We have pointed out that the problem of combining multiple cues for edge detection lies at a crossroads between theories of optimal cue combination and natural image statistics. Our approach was motivated by the idea that biological and machine vision systems should not only strive to combine edge cues optimally, but for reasons of parsimony, they should be structured to allow easy incorporation of additional cues whenever they become available in the course of evolution. One means to achieve easy extensibility is to design the system so that edge cues can be combined linearly. In this way, new cues that become available can simply be added into the existing mix with appropriate coefficients.
A Bayesian formulation of the problem tells us that a linear combination rule is optimal when the cues are classconditionally independent and satisfy certain distributional assumptions (Jacobs,
1995; see “
When a linear combination rule is best” section). In color edge detection, classconditional independence would mean that in the class of all true edges (and similarly for the class of all nonedges), the redgreen edge score carries no information about the blueyellow or luminance edge scores, and vice versa. Statistical independence is a frequent topic of discussion in connection with sparse V1like image representations, including results showing V1likeoriented filters emerge from independent components analysis (Olshausen & Field,
1996; Bell & Sejnowski,
1995). In actuality, however, when filter responses are collected from nearby locations in natural images, strong statistical dependencies are observed between them (Wainwright & Simoncelli,
2000; Schwartz & Simoncelli,
2001; Liang et al.,
2000; Zetsche & Röhrbein,
2001; Parra et al.,
2000; Wainwright, Schwartz, & Simoncelli,
2001; Fine et al.,
2003; Karklin & Lewicki,
2003,
2005).
What is the source of these dependencies in the context of color edge detection? First, even assuming the luminance, redgreen, and blueyellow values at each
pixel are statistically independent in natural images (they are not), it does not follow that edge detectors operating separately on the three color channels would be statistically independent. There are at least two reasons for this. First, the appearance and disappearance of physical edges within the overlapping receptive fields of two or more colocalized detectors will induce a dependency between them. This is because an edge is a spaceoccupying object whose occurrence will as a rule boost edge detector responses in all color channels simultaneously. Isoluminant or isoRG or BY edges are counterexamples to this rule, but such cases are relatively rare. Second, even when the joint statistics of two or more colocalized edge detectors are conditioned on the presence or absence of a true edge, thus factoring out the spatial edge–induced correlation, numerous studies of natural image statistics have shown that regional lighting, geometric or texture variations can introduce higherorder correlations between nearby detectors (Wegman & Zetzsche,
1990; Wainwright & Simoncelli,
2000; Schwartz & Simoncelli,
2001; Parra et al.,
2000; Zetzsche & Röhrbein,
2001; Karklin & Lewicki,
2003,
2005). A prime example is the modulatory effect of lighting intensity (e.g., sun vs. shade) on the outputs of all of the spatial filters within a local neighborhood.
In point of fact, therefore, colocalized edge detectors exhibit both (1) a correlation arising from their spatially overlapping receptive fields and coresponsiveness to most physical edges—this is a “good” kind of correlation, akin to the consensus one seeks from a panel of medical experts who use different methods to diagnose the same disease—and (2) a “bad” kind of higherorder correlation induced by lighting, geometric or texturerelated factors that modulate the spatial filters in a region up and down together, hence the term “common factor.” It is these undesirable correlations that lead a set of colocalized edge detectors to violate the CC independence assumption. And according to cuecombination theory, this makes edge cues in their unnormalized form more difficult to combine.
We conclude that an extensible boundarydetecting system should set up a first layer of circuitry that attempts to eliminate the higherorder correlations between edge cues using an appropriate divisive normalization scheme, while leaving the good, targetrelated correlations intact. A number of divisive normalization schemes have been previously proposed in addition to the one we describe here. Assuming the postnormalization cues satisfy the required distributional assumptions (see “
When a linear combination rule is best” section), they may simply be added to obtain a measure of the overall edge probability at any given location. With this scenario in mind, we developed a simple generative model to explain certain qualitative features of the joint distributions of two color edge channels operating on and off edges in complex indoor and outdoor scenes. We found that the SCF model with the parameters in
Table 1 generated good facsimiles to our empirical data based on humanlabeled contours, including the progression from diagonal to square contours seen in the ONedge and OFFedge distributions (
Figure 5A), and the general form of the higherorder correlation between the two edge variables (
Figure 5B). A more stringent validation of the generative model lay in its ability to selectively eliminate the undesirable higherorder statistical dependencies between colocalized edge detectors. To verify this, we inverted the SCF model to arrive at a twostage normalization scheme that is similar in spirit, but different in the details, from previously proposed normalization schemes. When we applied the SCFderived normalization scheme to the RG and BY edge responses measured in the humanlabeled data set, we recovered nearly independent exponentially distributed edge variables, and correspondingly, a linear combination rule. That the SCFnormalized color edge values would be classconditionally independent, exponentially distributed, and would lead to a linear combination rule, was not a forgone conclusion. For comparison, we applied an alternative (sumofsquares) normalization scheme and found that it was less well suited to normalize and combine the edge cues in our data set under two different assumed edge detector models (PD and Gabor) (
Figure 11). This suggests that the details of the normalization scheme can be quite important, and that the SCF model may provide a particularly good description of the joint response statistics of colocalized edge detectors in natural images.
Appropriateness of the humanlabeled
data set
The humanlabeled boundaries in the Martin et al. (
2001) data set were used to sort local oriented edgedetector responses into ONedge vs. OFFedge classes. The appropriateness of this data set for our purposes might be questioned, given that the human labelers clearly focused their attention on major object boundaries rather than local edges (
Figure 1). Many strong local edge responses would have been missing from the ONedge class and misclassified as OFFedge data. In addition, our automated edgesorting algorithm included in the ONedge class any response from an edge filter lying underneath and roughly aligned with a label from any of the 5 human subjects. This criterion was probably too liberal, leading to the inappropriate assignment of some weak filter responses to the ONedge class. The combined effect of these two choices means that our ONedge distribution probably contained an excess of weak edge values and a shortage of strong ones, and vice versa for the OFFedge distribution. The prior probability of an edge
P(ON) could also have been biased away from its true value, but it is difficult to say in which direction.
Despite these nonideal properties of our data set, it seems unlikely that a change in the human labeling strategy or an improvement in our mode of collecting ONedge vs. OFFedge data from the labeled images would change our results in any fundamental way. Recall that the SCFderived normalization, whose parameters were fit based on the human labeled data set, led to postnormalized color edge values that were close to exponentially distributed and classconditionally independent. Moreover, judging by the comparison to another (sumofsquares) normalization method (
Figure 11), the simplification that we observed both in the joint distributions and in the combination rule after applying the SCF normalization was not an inevitable outcome. It seems unlikely that the failure of human subjects to label many bona fide local edges, and the overly inclusive criterion we used to define the ONedge class based on the human labels, would conspire to simplify the statistics of the data set in such as way that the distributional assumptions of the SCF model would be accidentally satisfied. Nonetheless, it would be interesting to apply the SCF model and its corresponding normalization scheme to a labeled data set that emphasizes local edges rather than major object boundaries (Wilson, Ing, & Geisler,
2006).
Are ONedge and OFFedge structures fundamentally different?
It is at first surprising that the joint distributions in
Figure 5 show a similar kind of higherorder correlation in both ON and OFFedge data. This observation clashes with the intuitive notion that edges constitute different “stuff” than nonedges, and should therefore react differently to the modulatory influences that generate higher correlations between colocalized edge detectors. However, the natural world does not in fact contain a clean dichotomy between edges and nonedges, any more than it contains a clean dichotomy between chairs and nonchairs or any other natural category. Rather, an edge detector presented with an image patch asks the question, “How
good an edge do I see at my preferred location, orientation, and scale?” The answer is continuousvalued rather than categorical, a measure of the match between the complex, variegated local image structure and the edge detector's canonical multidimensional tuning curve. The key observation is that a local image structure rated as a poor edge by one particular detector, and thus relegated to that detector's OFFedge distribution, will often qualify as a good edge at a nearby location, orientation, and/or scale. It follows that all edgedetector responses evoked by a local image structure, including both the strong responses evoked in wellmatched edge channels, and the weak responses evoked in poorly matched channels, are subject to the same regional scaling factors that generate higherorder correlations between colocalized edge channels. This may explain why higherorder correlations are seen even in the OFFedge distribution for any given filter. Put another way, an OFFedge distribution might be more appropriately labeled “NOTONMYedge.”
When the use of human labels as ground truth isn't circular
In a statistical approach to cue combination, a kind of circularity would seem to exist when using a cuecombining edgedetection system, such as a human observer, to sort data into ONedge and OFFedge ground truth categories with the intention to use that ground truth to evaluate the performance of another cuecombining system that is specifically designed to be a model of the first. In the limit in which the model system becomes equivalent to the ground truth–generating system in design and sophistication, the likelihood functions
P(cues∣ON) and
P(cues∣OFF) would degenerate to reflect the deterministic assignment of inputs to ON and OFFedge categories by the ground truth classifier. That is, the ONedge cue distribution would have nonzero probability domains that are exactly complementary to those of the OFFedge distribution (unlike the case of
Figure 5 which contains no regions of zero probability), and the combination rule, unlike the rules shown in
Figure 8, would be a binaryvalued function that perfectly replicates the ground truth classifications. The reason that this degeneracy does not occur in the present context is that the humandrawn labels used to classify pixels into ON and OFFedge categories are the product of an extremely sophisticated longrange contour processing network residing within the human visual system. This ultrahigh end, contourprocessing network has far more information about object contour structure available to it than is available to any collection of local edge detectors, and far more sophisticated machinery for processing that information. The inability of a set of cues to perfectly reproduce the ground truth classifications, as indicated in our case by the heavily overlapping ONedge and OFFedge distributions, attests to the noncircularity of the approach.
Rationale for the compressive nonlinearity in the SCF model
A key feature of the SCF model is the compressive nonlinearity g( x) applied to each filter channel after it has been scaled by the common factor. One rationale for including this nonlinearity in the generative model for edges is that the dynamic range of distal edge magnitudes that can occur within natural scenes far exceeds the dynamic range of the physical devices—the neurons or camera pixels—that are designed to represent those values. For example, consider the border between two dark matte surfaces with luminances differing by only 5% vs. the border between a dark matte surface and a bright specular surface whose luminances can easily differ by a factor of 100. The need to cope with such a large range of edge magnitudes, in some cases within the same scene, means that edgerepresenting variables within a typical physical edgedetection system will for practical reasons carry rangecompressed signals.
Beyond the need to cope with dynamic range limitations, applying a compressive transform to a lowlevel sensory variable can have representational advantages as well. For example, a logarithmic function has the convenient property that justnoticeable differences in stimulus intensity will be proportional to the absolute stimulus magnitude (subject to certain assumptions); this is the basis of the WeberFechner psychophysical scaling law. A compressive transform may also simplify the joint distributions of two or more sensory variables: Ruderman et al. (
1998) found that a logarithmic transform of retinal cone responses led to a simpler, more symmetrical joint distribution of pixel data in a PCAderived coloropponent space like that found in the retina. Compressive transforms can also increase coding efficiency in sparsely distributed signals (Laughlin,
1981). Thus, on both practical and representational grounds, it seems reasonable to include an explicit compressive nonlinearity in a generative model for lowlevel visual cues. As a cautionary note in interpreting neural data, however, the fact that a neuron's response saturates with increasing stimulus intensity does not necessarily signify that the system has applied an explicit compressive nonlinearity to an underlying sensory variable. For example, a sublinear stimulusresponse curve can arise as a byproduct of a divisive normalization of the very kind we are concerned with here (e.g., see Bonin et al.,
2005). In particular, following the Bayesian logic of the SCF and other similar models, a sublinear response to increasing stimulus intensity may reflect (1) the system's assumption that the measured stimulus intensity is the product of a true feature value and a contaminating multiplicative factor, (2) the Bayesian inference that an increase in stimulus intensity is partly due to an increase in the contaminating factor, and (3) the fact that division by an everincreasing factor drags the curve down into the sublinear range. Viewed in this way, the cell's sublinear stimulusresponse curve simply reflects the system's best guess as to the distal feature's true value given the magnitude of the conflated stimulus presented as input. The situation could be more complicated than this, however, since the above argument does not rule out that an overt compressive nonlinearity has been applied
after the division operation, for range compression or other purposes.
Comparison to other normalization schemes
The inclusion of an explicit compressive nonlinearity in the generative (i.e., forward) direction of the SCF model underscores an interesting difference between it and related models. In virtually all normalization schemes, including the SCF normalization, the input variables feeding into the common factor estimate (i.e., the denominator) are run through an expansive nonlinearity before they are summed (
Equation 17). In the SCF model, the purpose of this operation is transparent: The expansive function
h(
x) is the inverse of the compressive function
g(
x) that is presumed to have acted on the variables prior to their arrival as inputs. In other normalization schemes, the expansive nonlinearity is most often a squaring operation (Heeger,
1992; Carandini et al.,
1997; Parra et al.,
2000; Schwartz & Simoncelli,
2001; Zetzsche & Röhrbein,
2001; Bonin et al.,
2005). The rationale for the squaring nonlinearity, however, is not that it inverts a squareroot operation that was applied earlier in the generative process. Rather, the square can be traced to the idea that the common factor modulates the variance or “energy” of the population response (Heeger,
1992), or to the assumption that the underlying sensory variables are Gaussian distributed (Wainwright & Simoncelli,
2000).
Linking the compressive nonlinearity to
MAXlike behavior
Whatever its source or purpose, a compressive nonlinearity profoundly affects the joint distributions of the measured sensory variables as well as the form of the optimal combination rule (
Figure 13). It also points to an unexpected connection between compressed sensory variables and combination rules with MAXlike behavior. Consider a cuecombination system whose input variables have been compressed by a sublinear function
g(
x), where
g(
x) can be thought of as a power function with a gradually decreasing exponent; both
g(
x) = log(1 +
x) and
g(
x) =
x/(
x +
K) have this property, whereas
g(
x) =
x ^{1/2} does not since the exponent (1/2) is constant. If the combination rule is a function of a sum of the original uncompressed variables
P(
T∣
x _{1},
x _{2},…
x _{N}) =
f(
x _{1} +
x _{2} + … +
x _{N})
, then the combination rule plotted in the space of the original variables
x _{ i} will have straight diagonal contours—this is true by definition. The output function
f(
x) changes only the contour spacing. In contrast, when the combination rule is expressed in terms of the sensory variables after they have been compressed, that is, where each original coordinate
x _{ i} has been replaced by
g(
x _{ i}), then the combination rule will have isoresponse contours more like those in
Figures 8,
13B, and
13C (but not
13D), progressing from close to diagonal near the origin to squarish as any one of the compressed variables grows large. Such a “LINMAX” progression of contours can be modeled as a family of curves of constant
p norm
with steadily increasing values of
p. Depending on the range of
p values encountered, which depends on the function
g(
x), the combination rule is approximately a function of the Hamming length of the input vector (
p = 1) when the inputs are all weak; the Euclidean length (
p = 2) in intermediate zones; and the MAX of the inputs (
p = ∞) when any one or more of the input variables grows large.
Thus one effect of the compressive nonlinearity is to make a single strong cue more powerful than several weak cues when the cue vectors are equated for length, usually taken to mean the Hamming or Euclidean length. In the context of edge detection, a LINMAX rule can be understood intuitively in this way: When all of the available edge detectors at a point in an image are responding weakly, corresponding to the linear range of g( x) along each cue dimension, the combination rule P(ON∣ r _{1}, r _{2},…) depends on the average of the detector responses—we may say this since the average is proportional to the sum. Averaging is an appropriate strategy when the goal is to suppress independent noise. In the MAXlike regime, on the other hand, a strong response in any single edge channel corresponds to an overridingly strong distal edge value in that channel, owing to the range compression that has previously occurred. Assuming that an edge is almost certain to be present at an image location based on a single strong cue, the edge probability has little room to increase as other cues are added. The MAXlike regime is thus a result of a probability ceiling effect.
The difference between linear and MAXlike modes of cue combination becomes exaggerated as the number of cues increases. For example, if 5 cues are available and each has a value ranging from 0 to 1, a cue vector (1, 0, 0, 0, 0) is as strong a combination as cue vector (1, 1, 1, 1, 1) according to a MAX rule, while a linear rule would greatly overestimate the strength of the second input in this case. Likewise, a vector of balanced weak cues (0.05, 0.05, 0.05, 0.05, 0.05) would be as powerful a cue combination as (0.2, 0, 0, 0, 0) for a linear rule, but would have its strength greatly underestimated by a pure MAX rule. The lesson here is that when input variables have been range compressed, a fixed combination rule that depends inalterably on either the sum or the MAX of the cues will yield poor estimates of edge probability for some cue combinations. This provides the rationale for first applying an expansive nonlinearity to the input variables—the first stage of the SCF normalization—to convert a LINMAX combination rule into a simpler linear one.
The connection between range compression of sensory cues and LINMAX combination rules has interesting implications for neural integration of sensory signals, and for cue combination at the perceptual level. Range compression through a logarithmic or similar transform is a common feature of biological sensory systems, suggesting that a LINMAX pattern of summation of sensory cues might be found at the single neuron level. This possibility could be tested in a cortical neuron that receives input from two or more independent sensory channels, for example, colorluminance cells in primate V1 (Johnson, Hawken, & Shapley
2001). Assuming an idealized LINMAX combination rule and a wellbehaved output function (e.g., corresponding to a typical FI curve), the firing rate of such a cell should increase roughly linearly with the superposition of weak sensory cues, but proportionally less as progressively stronger cues are combined, and should show response saturation for even a single strong cue. As a corollary, wherever multiple cues are integrated, one strong cue should be a more effective stimulus than multiple weak cues—assuming the cue vectors are equated using standard 1norm or 2norm measures.
As a caveat, given uncertainties in the precise form of
g(
x) and the output function
f(
x), the strongest predictions regarding a neuron's response to multiple cues derive from the shapes of the isoresponse contours. This is because the contours of the combination rule can be determined experimentally without specific knowledge of either
f or
g, as long as the input cue values
r _{1},
r _{2}, …
r _{N} are known and can be manipulated. More problematic are predictions framed in terms of a neuron's “summation arithmetic,” (i.e., whether summation of responses to two or more stimuli is linear, sublinear, MAXlike, etc.). Response magnitudes depend in nontrivial ways on both
f and
g and interactions between them, and these functions may be difficult to determine in specific cases. Furthermore, summation arithmetic can be profoundly affected by competitive interactions among stimuli presented simultaneously within a cell's receptive field, including divisionlike normalizing operations of the kind under consideration here. In contrast to measures of summation arithmetic, the isoresponse contours of a cuecombination cell can be identified without knowing f or g. Isoresponse contours can even remain stable after the input cues have been divisively normalized—as long as the contours in both classconditional joint distributions start out all the same shape. (This holds, for example, in the case of diagonal contours, as shown in the transition from
Figures 9B and
9C).
The arithmetic of response summation for two stimuli presented separately and together has been examined in the monkey striate and extrastriate cortex, though almost always in the context of spatial summation of stimuli presented at different classical RF locations (Movshon,
1978; Reynolds, Chelazzi, & Desimone,
1999; Gawne & Martin,
2002; Lampl, Ferster, Poggio, & Riesenhuber,
2004). Though the applicability of these data to cue combination is tenuous, the data nonetheless contain useful lessons. As previously mentioned, pair interactions are often competitive, meaning that the response to a pair of stimuli lies between the responses to the two individual stimuli. Competitive interactions, incidentally, can be symptomatic of a divisive normalization process. Beyond the frequent reports of competition, however, no single or simple pattern of spatial summation appears to apply to all neurons or all stimuli. Reynolds et al. (
1999) found that the combined response to a pair of stimuli in V2 and V4 was the average of the two individual responses—
on average. But they also showed that a wide variety of actual outcomes underlay the “average on average” property, ranging from MIN and below to MAX and above. Lampl et al. (
2004) measured subthreshold summation in V1 neurons, concluding that summation was a MAX on average, that is, the combined response was about equally often larger or smaller than the maximum of the two individual responses. Other authors have reported a wide range of summation outcomes as well, with MAXlike summation in a subset of the cases (Gawne & Martin,
2002; Avillac, Ben Hamed, & Duhamel,
2007). The lack of a clear pattern in these data as a whole could reflect the concern expressed above that predictions of response magnitudes and response summation arithmetic are beset by uncertainties, whereas predictions based on the shapes of isoresponse contours may prove easier to interpret. These issues will have to be resolved through further experimentation, including experiments specifically designed to test cuecombination rules.
If neurons do exhibit LINMAX summation, perception should perhaps follow a similar pattern. Taking perceptual salience of an edge as a measure of
P(ON∣cues), a LINMAX rule implies that the salience of an edge should increase substantially with the superposition of a set of weak cues, but with a progressively lower exponent as the cues grow stronger (
Figure 14).
The normalization pool: How many filters
is best?
For simplicity, we applied the SCF model to decorrelate and combine just two edge cues. Estimates of overall edge probability would likely be improved by including additional colocalized edge cues, most obviously the intensity channel, which is the third principal component of natural image spectra. The performance improvement expected with the inclusion of additional cues would flow from two sources. First, the larger the number of valid cues that are available, the greater the mutual information between the cue vector and the edge indicator variable. In the limit where many highly informative cues are available,
P(ON∣cues) would approach a binaryvalued function indicating complete certainty about the presence or absence of an edge in all situations. This type of performance improvement stems from the progressive divergence of the two classconditional distributions
P(cues∣ON) and
P(cues∣OFF) as new valid cue dimensions are added. The improvement would obtain whether or not a common factor was included in the generative model for edges. In the realistic case where a common factor exists, a second source of improved performance arises from better estimates of the common factor using the larger number of available cues (see
Equations 21,
22, and surrounding text).
These two distinct sources of improvement in edgedetection performance point to a dissociation between the use of cues to decide whether an edge is present based on local spatial structure, and the use of cues to estimate the common factor; different sets and numbers of filters could contribute to these two processes. For example, whereas in our experiments both computations were based on just two colocalized edge cues, the common factor estimate could in principle be based on a much larger pool of edge detector outputs from the surrounding neighborhood. The question thus arises: What is the optimal collection of filters from which to estimate the common factor modulating two or more colocalized edge channels? Wainwright et al. (
2000) have suggested that the higherorder correlations among filters in natural images may have a hierarchical structure, involving separable contributions at different spatial scales. Karklin and Lewicki (
2003,
2005) showed that the distribution of common factors (i.e., filter variances) across space and orientation could be modeled as a linear combination of variance basis functions, just as basis functions are used to generate images patches in conventional applications of ICA. In both of these approaches, the “scale parameter” at any given point/orientation derives from a substantial block of image data, and potentially the entire image. As the region increases in size from which a local scale parameter is computed, however, two effects could actually worsen estimates, especially if—for reasons of biological computational parsimony—they rely on relatively simple averaging methods such as
Equation 21. First, common factors are regional, but the regions needn't be large. Thus, the spatial scale of the pool should match the scale over which lighting conditions, textural variations, and other common factor–inducing processes typically change. Considering that texture changes occur on the scale of object surfaces, which can be small and have sharp borders, and lighting conditions can also vary on a fine spatial scale (consider dappled sunlight under a forest canopy), pressure exists to keep simple averagebased common factor estimates quite localized as was done in our experiments. A second effect is that the statistical structure of natural images produces spatial correlations between filters that are not straightforward to factor out. For example, unlike two colocalized, cooriented edge detectors in different color channels, which are roughly independent both ON and OFF edges, two filters at 180° shifted phases, or at orthogonal orientations, or in colinear arrangements, can have strong positive or negative correlations as the case may be, both ON and OFF edges. These departures from independence in the surrounding pool of filters will inevitably complicate the common factor estimate, perhaps even pushing the necessary computations beyond the capabilities of a realistic neural circuit. This again highlights the potential advantage of restricting the pool to contain only those filters whose outputs are, as a group, related to the unknown common factor in an easytocompute way. That a small number of filter values can provide a good estimate of the common factor is supported by the fact that the SCF model was able to eliminate most of the higherorder correlations between RG and BY edges in the color edge data set using only the two cues themselves. Further research will be needed to determine how neural circuits have managed the tradeoff between the need for effective normalization processes and the need for tractable neural algorithms.
Appendix A
Computing the joint density of two edge cues from the SCF model
We begin with the cumulative joint distribution of
r _{1} and
r _{2}:
The joint density of
r _{1} and
r _{2} can be obtained through the partial derivative of the cumulative distribution:
Fitting the SCF model to the data using only one parameter per joint distribution
To fit the generative model (
Equations 12–
16), we maximized the likelihood,
in which the joint density function
P() is defined as in
Equation A2. Notice that
R _{ i} =
C *
e _{ i}, and
C and
e _{ i} are both exponentially distributed random variables. An exponential variable with mean
λ is equivalent to a constant
λ times an exponentially distributed variable with mean 1, so that the distribution of the product
C*
e is invariant as long as the product of the two means 1/(
p*
q) remains constant (or likewise the reciprocal of the mean
p*
q). This extra degree of freedom makes it possible to fix
p and vary only
q. Furthermore, given that the saturating nonlinearity
g() in the SCF model can be expressed as
g(
x/
K), where the parameter
K simply scales the input variable, the flexibility provided by
K can be subsumed by the scaling factor attached to
C*
e. The distribution of the output
g(
C*
e) is therefore invariant as long as the product
p*
q*
K remains constant. This allowed us to fix
p =
K = 1 during the optimization process while varying only
q.
We minimized the negative loglikelihood:
Filter response distributions collected ON and OFFedges were fitted separately using 120,000 data points in each case, resulting in the parameters
q _{ON} and
q _{OFF} as shown in
Table 1.
Computing the most probable common factor
We start with an expression for the posterior probability:
Using
Equation 12 through
16 and the parameters from
Table 1, and by setting the partial derivative to 0, we have
Equation A6 has no straightforward explicit solution. However, given that all variables in
Equation A6 are positive, we note that it can only be equal to 0 if the two terms in parentheses have opposite signs:
The solution
C to
Equation A7, and hence to
Equation A6, lies in between the solutions of the following two equations:
which are
Given that in natural images
q _{OFF} >
q _{ON} and
P(OFF) ≫
P(ON), we expect the solution to
Equation A6 to be close to
C _{2},
Acknowledgments
Thanks to Gary Holt, codesigner of the PD filter, and for useful discussions in early phases of this work. Thanks to Fritz Sommer, Allan Yuille, the anonymous reviewers and the editor for critical comments on the manuscript, to Elizabeth Johnson for the coneisolating stimulus method, and to Chait Ramachandra for invaluable technical assistance. This work was funded through grants from the Office of Naval Research, Army Research Office, National Science Foundation, and National Institutes of Health.
Commercial relationships: none.
Corresponding author: Bartlett W. Mel.
Address: Second Sight Medical Products, Inc., Sylmar, CA, USA.
References
Albrecht, D. G.
Geisler, W. S.
(1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. [
PubMed]
[CrossRef] [PubMed]
Avillac, M.
Ben Hamed, S.
Duhamel, J. R.
(2007). Multisensory integration in the ventral intraparietal area of the macaque monkey. Journal of Neuroscience, 27, 1922–1932. [
PubMed] [
Article]
[CrossRef] [PubMed]
Badcock, D. R.
Westheimer, G.
(1985). Spatial location and hyperacuity: The centre/surround localization contribution function has two substrates. Vision Research, 25, 1259–1267. [
PubMed]
[CrossRef] [PubMed]
Balboa, R. M.
Grzywacz, N. M.
(2003). Power spectra and distribution of contrasts of natural images from different habitats. Vision Research, 43, 2527–2537. [
PubMed]
[CrossRef] [PubMed]
Bell, A. J.
Sejnowski, T. J.
(1995). An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. [
PubMed]
[CrossRef] [PubMed]
Bishop, C. M.
(1996). Neural networks for pattern recognition. Oxford: Oxford University Press.
Bonds, A. B.
(1989). Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Visual Neuroscience, 2, 41–55. [
PubMed]
[CrossRef] [PubMed]
Bonin, V.
Mante, V.
Carandini, M.
(2005). The suppressive field of neurons in lateral geniculate nucleus. Journal of Neuroscience, 25, 10844–10856. [
PubMed] [
Article]
[CrossRef] [PubMed]
Bülthoff, H. H.
Mallot, H. A.
(1998). Integration of depth modules: Stereo and shading. Journal of the Optical Society of America A, Optics and Image Science, 5, 1749–1758. [
PubMed]
[CrossRef]
Carandini, M.
Heeger, D. J.
Movshon, J. A.
(1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17, 8621–8644. [
PubMed] [
Article]
[PubMed]
Derrington, A. M.
Badcock, D. R.
(1985). Separate detectors for simple and complex grating patterns? Vision Research, 25, 1869–1878. [
PubMed]
[CrossRef] [PubMed]
Ernst, M. O.
Banks, M. S.
(2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429–433. [
PubMed]
[CrossRef] [PubMed]
Fine, I.
MacLeod, D. I. A.
Boynton, G. M.
(2003). Surface segmentation based on the luminance and color statistics of natural scenes. Journal of the Optical Society of America A, Optics, Image Science, and Vision, 20, 1283–1291. [
PubMed]
[CrossRef] [PubMed]
Frome, F. S.
Buck, S. L.
Boynton, R. M.
(1981). Visibility of borders: Separate and combined effects of color differences, luminance contrast, and luminance level. Journal of the Optical Society of America, 71, 145–150. [
PubMed]
[CrossRef] [PubMed]
Gawne, T. J.
Martin, J. M.
(2002). Responses of primate visual cortical V4 neurons to simultaneously presented stimuli. Journal of Neurophysiology, 88, 1128–1135. [
PubMed] [
Article]
[CrossRef] [PubMed]
Geisler, W. S.
Albrecht, D. G.
(1992). Cortical neurons: Isolation of contrast gain control. Vision Research, 32, 1409–1410. [
PubMed]
[CrossRef] [PubMed]
Geisler, W. S.
Albrecht, D. G.
(1997). Visual cortex neurons in monkeys and cats: Detection, discrimination, and identification. Visual Neuroscience, 14, 897–919. [
PubMed]
[CrossRef] [PubMed]
Granander, U.
(1996). Elements of pattern theory.. Baltimore: John Hopkins Press.
Gray, R.
Regan, D.
(1997). Vernier step acuity and bisection acuity for texturedefined form. Vision Research, 37, 1717–1723. [
PubMed]
[CrossRef] [PubMed]
Grzywacz, N. M.
Yuille, A. L.
(1990). A model for the estimate of local image velocity by cells in the visual cortex. Proceedings of the Royal Society of London B: Biological Sciences, 239, 129–161. [
PubMed]
[CrossRef]
Heeger, D. J.
(1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. [
PubMed]
[CrossRef] [PubMed]
Hyvarinen, A.
Oja, E.
(1997). A fast fixedpoint algorithm for independent component analysis. Neural Computation, 9, 1483–1492
[CrossRef]
Jacobs, R. A.
(1995). Methods for combining experts' probability assessments. Neural Computation, 7, 867–888. [
PubMed]
[CrossRef] [PubMed]
Johnson, E. N.
Hawken, M. J.
Shapley, R.
(2001). The spatial transformation of color in the primary visual cortex of the macaque monkey. Nature Neuroscience, 4, 409–416. [
PubMed] [
Article]
[CrossRef] [PubMed]
Karklin, Y.
Lewicki, M. S.
(2003). Learning higherorder structures in natural images. Network, 14, 483–499. [
PubMed]
[CrossRef] [PubMed]
Karklin, Y.
Lewicki, M. S.
(2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17, 397–423. [
PubMed]
[CrossRef] [PubMed]
Knill, D. C.
(1998). Ideal observer perturbation analysis reveals human strategies for inferring surface orientation from texture. Vision Research, 38, 2635–2656. [
PubMed]
[CrossRef] [PubMed]
Konishi, S.
Yuille, A. L.
Coughlan, J. M.
Zhu, S. C.
(2003). Statistical edge detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 57–74.
[CrossRef]
Lampl, I.
Ferster, D.
Poggio, T.
Riesenhuber, M.
(2004). Intracellular measurements of spatial integration and the MAX operation in complex cells of the cat primary visual cortex. Journal of Neurophysiology, 92, 2704–2713. [
PubMed] [
Article]
[CrossRef] [PubMed]
Landy, M. S.
Kojima, H.
(2001). Ideal cue combination for localizing texturedefined edges. Journal of the Optical Society of America A, Optics, Image Science, and Vision, 18, 2307–2320. [
PubMed]
[CrossRef] [PubMed]
Landy, M. S.
Maloney, L. T.
Johnston, E. B.
Young, M.
(1995). Measurement and modeling of depth cue combination: In defense of weak fusion. Vision Research, 35, 389–412. [
PubMed]
[CrossRef] [PubMed]
Laughlin, S.
(1981). A simple coding procedure enhances a neuron's information capacity. Zeitschrift für Naturforschung C: Biosciences, 36, 910–912. [
PubMed]
Ledgeway, T.
Smith, A. T.
(1994). Evidence for separate motiondetecting mechanisms for first and secondorder motion in human vision. Vision Research, 34, 2727–2740. [
PubMed]
[CrossRef] [PubMed]
Liang, Y.
Simoncelli, E. P.
Lei, Z.
(2000). Color channels decorrelation by ICA transformation in the wavelet domain for color texture analysis and synthesis. Computer Vision and Pattern Recognition, 1, 1606–1611.
Liow, Y. T.
(1991). A contour tracing algorithm that preserves common boundaries between regions. CVGIP, 3, 313–321.
[CrossRef]
Maloney, L. T.
(1999). Physicsbased approaches to modeling surface color perception. (pp. 387–422). Cambridge: Cambridge University Press.
Maloney, L. T.
Yang, J. N.
(2004). The illuminant estimation hypothesis and surface color perception. Oxford: Oxford University Press.
Mamassian, P.
Landy, M. S.
Maloney, L. T.
Rao,, R.
Olshausen,, B.
Lewicki, M.
(2002).
Bayesian modelling of visual perception. Probabilistic models of the brain: Perception and neural function. (pp. 13–36). Cambridge: MIT Press.
Martin, D.
Fowlkes, C.
Tal, D.
Malik, J.
(2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. International Conference on Computer Vision. Vancouver, BC, Canada.
Nelson, S. B.
(1991). Temporal interactions in the cat visual system I Orientationselective suppression in the visual cortex. Journal of Neuroscience, 11, 344–356. [
PubMed] [
Article]
[PubMed]
Ohzawa, I.
Sclar, G.
Freeman, R. D.
(1982). Contrast gain control in the cat visual cortex. Nature, 298, 266–268. [
PubMed]
[CrossRef] [PubMed]
Olman, C.
Kersten, D.
(2004). Classification objects, ideal observers & generative models. Cognitive Science, 28, 227–239.
[CrossRef]
Olshausen, B. A.
Field, D. J.
(1996). Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. [
PubMed]
[CrossRef] [PubMed]
Parra, L.
Spence, C.
Sajda, P.
Leen,, T. K.
Dietterich,, T. G.
Tresp, V.
(2000).
Highorder statistical properties arising from the nonstationarity of natural signals. Advances in neural information processing systems. (13, pp. 782–792). Cambridge: MIT Press.
Porrill, J.
Frisby, J. P.
Adams, W. J.
Buckley, D.
(1999). Robust and optimal use of information in stereo vision. Nature, 397, 63–66. [
PubMed] [
Article]
[CrossRef] [PubMed]
Reichardt, W.
Poggio, T.
(1979). Figureground discrimination by relative movement in the visual system of the fly. Biological Cybernetics, 35, 81–100.
[CrossRef]
Reynolds, J. H.
Chelazzi, L.
Desimone, R.
(1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19, 1736–1753. [
PubMed] [
Article]
[PubMed]
Rivest, J.
Boutet, I.
Intriligator, J.
(1997). Perceptual learning of orientation discrimination by more than one attribute. Vision Research, 37, 273–281. [
PubMed]
[CrossRef] [PubMed]
Rivest, J.
Cavanagh, P.
(1996). Localizing contours defined by more than one attribute. Vision Research, 36, 53–66. [
PubMed]
[CrossRef] [PubMed]
Ruderman, D. L.
Cronin, T. W.
Chiao, C.
(1998). Statistics of cone responses to natural images: Implications for visual coding. Journal of the Optical Society of America A, 15, 2036–2045.
[CrossRef]
Saunders, J. A.
Knill, D. C.
(2001). Perception of 3D surface orientation from skew symmetry. Vision Research, 41, 3163–3183. [
PubMed]
[CrossRef] [PubMed]
Schwartz, O.
Simoncelli, E. P.
(2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4, 819–825. [
PubMed] [
Article]
[CrossRef] [PubMed]
ScottSamuel, N. E.
Georgeson, M. A.
(1999). Does early non‐linearity account for secondorder motion? Vision Research, 39, 2853–2865. [
PubMed]
[CrossRef] [PubMed]
Wainwright, M. J.
Schwartz, O.
Simoncelli, E. P.
Rao,, R.
Olshausen,, B.
Lewicki, M.
(2001).
Natural image statistics and divisive normalization: Modeling nonlinearity and adaptation in cortical neurons. Statistical theories of the brain. Cambridge: MIT Press.
Wainwright, M. J.
Simoncelli, E. P.
Solla,, S.
Leen,, T.
Muller, K. R.
(2000).
Scale mixture of Gaussians and the statistics of natural images. Advances in neural information processing systems. (pp. 55–861). Cambridge: MIT Press.
Wainwright, M. J.
Simoncelli, E. P.
Willsky, A. S.
(2000). Random cascades of Gaussian scale mixture and their use in modeling natural images with application to denoisingn Proceedings of the 7th international conference on image processing (260) –263). Vancouver, BC, Canada.
Wandell, B. A.
(1995). Foundations of vision. Sunderland, MA: Sinauer Associates, Inc.
Wegman, B.
Zetsche, C.
(1990). Statistical dependence between orientation filter outputs used in a human‐vision‐based image code. Visual Communications and Image Processing, 1360, 909–922.
Wilson, H. R.
Kim, J.
(1994). A model for motion coherence and transparency. Visual Neuroscience, 11, 1205–1220. [
PubMed]
[CrossRef] [PubMed]
Wilson, J. A.
Ing, A. D.
Geisler, W. S.
(2006). Chromatic differences within surfaces and across surface boundaries [
Abstract]. Journal of Vision, 6, (6):559,
[CrossRef]
Yuille, A.
Bülthoff, H. H.
Knill, D.
Richards, W.
(1996).
Bayesian decision theory and psychophysics. Perception as Bayesian inference. (pp. 123–161). Cambridge: Cambridge University Press.
Zetzsche, C.
Röhrbein, F.
(2001). Nonlinear and extraclassical receptive field properties and the statistics of natural scenes. Network, 12, 331–350. [
PubMed]
[CrossRef] [PubMed]