Figure 3 shows the empirical distribution of cue responses at a single scale (
r = 5% contour length) for 50,000 points sampled from the human-labeled boundaries. We plot only distributions for positive values of each cue. Because every boundary point contributes two values of equal magnitude and opposite sign, the distributions of negative values are identical with the roles of figure and ground reversed. Note that the marginal distribution of contour orientations is not uniform. The greater prevalence of horizontal (LowerRegion = 1) and vertical (LowerRegion = 0) boundaries is consistent with previous results on the statistics of brightness edges in natural images (Switkes, Mayer, & Sloan,
1978).
These histograms show that figural regions in natural scenes tend to be smaller, more convex, and lie below the ground regions. For example, when the sizes of the two regions are the same, Size(
p) = log(Area
1/Area
2) = 0, they are equally likely to be figure. When one region is larger, Size(
p) > 0, it is more common that the larger region is ground. All three cues uniformly differentiate figure and ground on average, in agreement with psychophysical demonstrations of the corresponding Gestalt cues (Kanizsa & Gerbino,
1976; Metzger,
1953; Rubin,
1921; Vecera et al.,
2002). At 5% contour length, we estimate the mutual information (Cover & Thomas,
1991) between each cue and the true label to be 0.047, 0.075, and 0.018 bits for Size, LowerRegion, and Convexity, respectively.
To further gauge the relative power of these three cues, we framed the problem of figure–ground assignment as a discriminative classification task: “With what accuracy can a cue predict the correct figure–ground labeling?”
For individual cues, it is clear from
Figure 3 that the optimal strategy is to always report the smaller, more convex, or lower region as figure. To combine multiple cues, we fit a logistic function,
which takes a linear combination of the cue responses at point
p, arranged into vector
c(
p) along with a constant offset, and applies a sigmoidal nonlinearity. The classifier outputs a value in [0, 1] that is an estimate of the likelihood that a segment is figural. In the classification setting, we declare a segment to be figure if this likelihood is greater than 0.5. The model parameters
β were fit using iteratively reweighted least squares to maximize the training data likelihood (Hastie, Tibshirani, & Friedman,
2001). We also considered models that attempted to exploit nonlinear interactions between the cues, such as logistic regression with quadratic terms and nonparametric density estimation, but found no significant gains in performance over the simple linear model.
Figure 4 shows the correct classification rate as a function of the analysis window radius for different combinations of cues. Values in the legend give the best classification rate achieved for each combination of cues. The performance figures suggest that all three cues are predictive of figure–ground, with Size being the most powerful, followed by LowerRegion and Convexity. Combining LowerRegion and the Size cues yields better performance, indicating that independent information is available in each. The addition of Convexity when Size is already in use yields smaller performance gains because these two cues are closely related: A locally smaller region tends to be locally convex.
We found that increasing context past 25% contour length did not further improve the model performance. In fact, computing the relative Size, Convexity, and LowerRegion at the level of whole segments (100% context) yielded lower correct classification rates of 56.9%, 55.4%, and 59.5%, respectively. One explanation for the worse performance of global Size and Convexity is that natural scenes typically involve many interacting objects and surfaces. Object A may occlude object B, creating a contour whose local convexity cue is consistent with the figure–ground layout. However, the global convexity of the region composing A may well be affected by its relation to other objects C, D, E, and so forth, in a manner that is largely independent of the figure–ground relation between A and B.
At the most informative window radius, our combined model achieved a 74% correct classification rate, falling short of the human labeling consistency (96%). This is likely due to several sources of information absent from our local model that could have been exploited by human subjects viewing a whole image during labeling. First, integration of local noisy measurements along a contour should yield a consistent label for the entire contour. Our feed-forward approach does not assume that grouping of contours has taken place before figure–ground binding begins. Second, we exclude junctions from our analysis. Junctions embody important information about the depth ordering of regions; however, they are quite difficult to detect locally in natural scenes (McDermott,
2004). Third, human subjects have access to important nonlocal and high-level cues such as symmetry (Bahnsen,
1928), parallelism (Metzger,
1953), and familiarity (Peterson,
1994; Rubin,
1921), which we have not considered here. Lastly, our model only utilizes the shape or configuration of the abutting regions, with no regard to the luminance content associated with each one. This ignores important local photometric evidence such as terminators signaling occlusion (von der Heydt & Peterhans,
1989) and cues to three-dimensional geometry such as texture, shading, and familiarity.