We investigated perceptual segmentation in the context of a perceived-orientation task. Stimuli were dot clusters formed by the union of a large elliptical sub-cluster and a secondary circular sub-cluster. We manipulated the separation between the two sub-clusters, their common dot density, and the size of the secondary sub-cluster. As the separation between sub-clusters increased, the orientation perceived by observers shifted gradually from the global principal axis of the entire cluster to that of the main sub-cluster alone. Thus, with increasing separation, the dots within the secondary sub-cluster were assigned systematically lower weights in the principal-axis computation. In addition, this shift occurred at smaller separations for higher dot densities—consistent with the idea that reliable segmentation is possible with smaller separations when the dot density is high. We propose that the visual system employs a robust statistical estimator in this task and that data points are weighted differentially based on the likelihood that they arose from a separate generative process. However, unlike in standard robust estimation, weights based on residuals are insufficient to characterize human segmentation. Rather, these must be computed based on more comprehensive generative models of dot clusters.

*aperture problem*). However, such pooling can also yield highly biased estimates if motion signals from two different objects are mixed—e.g., if along with the moving object of interest, the pooling also includes motion signals from a stationary background or a nearby object moving in the opposite direction (e.g., McDermott & Adelson, 2004; McDermott, Weiss, & Adelson, 2001). Similarly, object localization (e.g., for guiding saccades) requires that the visual information for one object be segregated from visual information corresponding to other objects before an estimate of location is computed (Cohen, Schnitzer, Gersch, Singh, & Kowler, 2007; Denisova, Singh, & Kowler, 2006; Melcher & Kowler, 1999; Vishwanathan, Kowler, & Feldman, 2000). Thus, the estimation of any visual property presupposes perceptual segmentation, the division into “perceptual groups.”

^{1}(Lánský, Yakimoff, & Radil, 1987; Lánský, Yakimoff, Radil, & Mitrani, 1989; Yakimoff, 1981; Yodogawa, 1985). The visual system's reliance on the principal axis has been demonstrated for dot clusters sampled from different distributions—including a uniform distribution on rectangular regions (Yakimoff, 1981) and bivariate Gaussian distributions (Lánský et al., 1987, 1989). There are also, however, secondary influences, such as a small bias toward the cardinal and the ±45° directions (Lánský et al., 1989). Not surprisingly, the precision of observers' orientation estimates decreases with increasing spread of the dots (i.e., decrease in correlation). However, Lánský et al. (1987) found that observers' performance in estimating orientation remained above chance even for correlations as low as 10%.

^{2}(Blum, 1973). Although the medial axis may indeed play a role, it is again noteworthy that Burbeck and Zauberman's results are also consistent with a principal-axis computation—a possibility they did not consider (see Cohen & Singh, 2006).

*μ*and variance

*σ*

^{2}of a Gaussian distribution. We can derive estimators that are unbiased and have minimum variance. If we were certain that the distribution of data is Gaussian it would be difficult to justify not using such estimators.

*ɛ*of trials, the data are drawn from the non-Gaussian distribution, and this contamination may manifest itself by the presence of evident outliers in the data.

*ɛ*. Robust statistics consists of methods that allow us to derive estimators that are close to optimal for an uncontaminated distribution but that are resistant to small amounts of contamination

*ɛ*(Hampel, 1974; Huber, 1981). The “goodness” of each possible estimator is evaluated not only for the base distribution

*f*(

*x*) but also for a neighborhood of contaminated distributions (1 −

*ɛ*)

*f*(

*x*) +

*ɛh*(

*x*), where

*h*(

*x*) is any distribution and 0 <

*ɛ*≤

*ɛ*

_{0}. A typical choice of robust statistic is the one with the best “worst case” performance in this neighborhood (Huber, 1981).

*μ,*the mean of a data sample

*μ*(Lehman, 1983, p. 84). If up to 1–2% of the data could be extreme values drawn from a second, unknown distribution then a 10% trimmed mean (where we discard the smallest 5% of the sample and the largest 5% of the sample) would typically be a lower variance estimator than the mean. If the outliers resulting from contamination are much larger in absolute value than the typical samples form the Gaussian, the 10% trimmed mean could be much lower in variance than the ordinary mean. But we can do even better (in terms of lower variance) than the 10% trimmed mean if we use a Huber-M estimator that assigns different weights to points depending on how extreme they are (Maronna, Martin, & Yohai, 2006, pp. 25–29). We will illustrate such an estimator in the General discussion section.

^{3}In terms of the influence function, one would thus expect that with increasing separation between the two sub-clusters, the influence function for high-density dot clusters would peak and fall sooner (i.e., at smaller spatial separations) than that for low-density dot clusters—because the high-density sub-clusters presumably require a smaller spatial separation to be reliably segmented. Our experiment manipulates both the separation between the two sub-clusters and their (common) dot density.

- The separation Δ between the two sub-clusters, as measured by the orthogonal distance of the center of the small circle from the major axis of the ellipse. The four values of Δ used were 1.52, 2.61, 3.7, and 4.79 dva.
- The mean density of dots in the sub-clusters. The three values used were 0.63, 1.26, and 1.89 dots per square dva. These density values determined the number of dots to be sampled within the elliptical and circular regions on any given trial.
- The diameter of the small circular region: 1.74 and 2.4 dva.

*x*-axis in each plot thus corresponds to the prediction of the “full-segmentation” hypothesis discussed above (recall Figure 3), namely, the orientation settings that would be predicted if observers treated the smaller sub-cluster as completely separate, and ignored it entirely in computing the principal axis of the dot cluster. The dashed lines in each plot depict the predictions of the “no-segmentation” hypothesis. The three dashed lines correspond to the three different levels of dot density and were computed based on the actual (sampled) dot configurations shown during the course of the experiment. These are thus the mean orientation settings that would be predicted if observers treated each cluster as a single, unsegmented, perceptual unit and computed its principal axis by weighting all dot equally.

*x*-axis of each plot)—indicating that the dots within the small cluster are ignored almost entirely in these conditions.

*α*. For each trial, we determined the value of the partial weight

*α*(to the dots within the smaller sub-cluster) that would yield the observed orientation setting on that trial. We did this by computing the orientation setting as a function of

*α*numerically for many finely spaced values of

*α*and selecting the value of

*α*that led to (or best approximated) the observed setting on that trial. These estimated values of the weights

*α*were then averaged within each given condition. Figure 8 plots these mean weights as a function of spatial separation, dot density, and secondary-cluster size. A value of 0 along the

*y*-axis in Figure 8 corresponds to the prediction of the “full-segmentation” hypothesis, whereas a value of 1 corresponds to the prediction of the “no-segmentation” hypothesis, i.e., the principal axis of the entire cluster.

*α*assigned to dots within the smaller cluster are statistically indistinguishable from 1. But with further increase in spatial separation (Δ = 3.7 and 4.79 dva), the orientation settings approach the predictions of the “full-segmentation” prediction (i.e., the partial weights

*α*approach 0).

*α*for the highest dot density (magenta curves) are consistently lower—i.e., closer to the “full-segmentation” prediction—than those for the lowest dot density (black curves). This is consistent with the prediction that, given a particular level of spatial separation between the two sub-clusters, dot clusters with higher dot density are more likely to be perceived as segmented. (In other words, dot clusters with higher dot density require a smaller spatial separation in order to be reliably segmented.)

*α*in a differentially weighted principal-axis computation, the data no longer show an effect of size of the secondary sub-cluster. Therefore, the difference between the left and the right subplots in Figure 7 can be attributed entirely to the fact that a larger secondary sub-cluster necessarily exerts a greater influence on the principal axis of the overall cluster than a smaller one.

*z*-scores.

*z*-scores were then converted to weights by a weighting function (the Huber-

*k*loss function shown in Figure 9). The weighting function assigns full weight to

*z*-scores between −

*k*and +

*k*and a lower score to more extreme

*z*-scores, decreasing with absolute magnitude of the

*z*-score.

^{4}We selected the value

*k*= 1.5 for this example, but performance is not sensitive to

*k*. The principal axis was then recomputed by PCA but now weighting each point according to the weights assigned by the Huber-

*k*loss function.

*z*-scores near 0 and weights near 1 and will exert considerable influence on the final estimate of the principal axis. It seems plausible that the visual system would segment both the further cluster and the closer, ignoring both, but a residual-based algorithm cannot do so easily.

*f*(

*x*;

*μ, σ*

^{2}),

*μ*and unknown variance

*σ*

^{2}. Suppose that the experimenter's goal is to estimate the unknown parameter

*μ,*the mean of the distribution. Given this information, we can derive the estimator

*T*(

*X*

_{1},…,

*X*

_{n}) that is unbiased (the expected value of the estimator

*E*[

*T*(

*X*

_{1},…,

*X*

_{n})] =

*μ*) and that has the smallest variance among all unbiased estimators.

*T*(

*X*

_{1},…,

*X*

_{n}) =

*t*-distribution with 3 degrees of freedom, the roles of the median and the mean reverse, and now the median has lower variance than the mean by about 40%.

*ɛ*but with a small probability

*ɛ*from a second unknown distribution,

*h*(

*x*). The resulting distribution is a mixture distribution:

*μ, σ*

^{2}but now we have the added complication that part of our data may be drawn from an unknown second distribution

*h*(

*x*).

*μ,*the mean and the median are in competition as estimators. In the uncontaminated case, for large samples, the median has a variance that is more than 60% larger than that of mean. However, with even small amount of contamination, the median can have lower variance than the mean.

*ɛ*. A more precise criterion might be to rate each possible statistical estimator by its worst performance for any choice of the unknown contamination

*h*(

*x*).

*influence function*(Hampel, 1974). We restrict attention to contaminated distributions that are Dirac (impulse) functions at location

*x,*

*δ*(

*x*) denotes a Dirac generalized function (Bracewell, 2000, chap. 5). The contaminated distribution above is readily interpreted. With probability

*ɛ,*the value

*x*is substituted for the sample value that would have been drawn from

*f*(

*x*). We denote any estimator by

*T*

_{n}(

*X*

_{1},…,

*X*

_{n}). If we increase the sample size

*n,*we can compute the limit of

*T*

_{n}denoted

*T*

_{∞}, and we can write

*T*

_{∞}(

*f*(

*x*)) as shorthand for the limit as sample size increases when samples are drawn from the distribution

*f*(

*x*). If, for example,

*f*(

*x*) is Gaussian with mean

*μ,*and

*T*

_{n}(

*X*

_{1},…,

*X*

_{n}) =

*T*

_{∞}(

*f*(

*x*)) =

*μ*: the limit of the mean of the sample converges to the population mean as sample size increases.

*T*

_{n}(

*X*

_{1},…,

*X*

_{n}). The influence function we report in the text is an empirical version of this theoretical limit.

*T*

_{n}(

*X*

_{1},…,

*X*

_{n}) has an influence function that goes to 0 as ∣

*x*∣ increases or is at least bounded. In contrast, we can show that for the ordinary mean

*T*

_{n}(

*X*

_{1},…,

*X*

_{n}) =

*T*′

_{x}(

*f*) =

*x*increases without bound. The mean is not a robust estimator.

^{2}Blum's (1973) medial-axis transform treats each shape as the union of maximally inscribed disks. The locus of the centers of these circles defines the medial axis, which provides a skeletal description of the shape.

^{3}Surprisingly, however, subjects appear not to be very good at taking into account sample size in making intuitive cognitive judgments concerning whether or not two samples are drawn from the same population (see Obrecht, Chapman, & Gelman, 2007).

^{4}This robust estimator is an example of a common method adopted from robust regression. Essentially any curve that smoothly decays to 0 would lead to better performance than a non-robust estimator. The resulting estimator would be robust to some extent but would not have the qualitative properties that human observers exhibit. The choice of an “optimal” robust estimator depends on not just the assumed distribution of that data but possible small deviations from that distribution and typically there is no one correct choice.

*t*tests: Lay use of statistical information. Psychonomic Bulletin & Review, 14, 1147–1152. [PubMed] [CrossRef] [PubMed]