The central result here is that efficiencies were high in both centroid tasks, but substantially lower in the mean-size task. Based on previous literature, these results were expected for the mean-size task and the equi-weighted centroid task. The simulations reported by Myczek and Simons (
2008) suggested that the estimate of the mean size of a group of items is obtained with low postfilter efficiency. Also, previous research from our lab (Drew et al.,
2010; Sun et al.,
2016) found that equi-weighted centroids could be estimated with high postfilter efficiency. The surprising result is that locating the centroid while weighting items in proportion to their size can also be done with high postfilter efficiency. This is surprising because one might expect that postfilter efficiency the in size-weighted centroid task would be no better than the lesser of the two obtained in the mean-size task and the equi-weighted centroid task. Our results show that observers achieved almost identical high efficiencies in the two centroid tasks and that the postfilter efficiency was much lower in the mean-size task. However, this implies, counterintuitively, that a summary statistical representation based on a combination of two distinct kinds of information—location and size—appears to be substantially easier for observers than a summary statistical representation based on only one of these components (size).
These results suggest that estimation of mean size is different and perhaps more difficult for observers than a centroid task that also involves size information. First, the high efficiencies achieved in the size-weighted centroid task show that both location and size information are accurately registered for most, if not all, of the squares. Second, the influence-function analysis suggests that although observers can achieve a weighting rule that accurately gauges the sizes of display squares in the size-weighted centroid task, they are unable to achieve such a weighting rule in the mean-size task.
Figure 11 shows the RMSE for the mean-size task in degrees of visual angle, broken out by number of items on the abscissa. Each of the colored lines connecting the X plotting symbols reflects the data from one observer. The black circles show the mean error, for each numerosity, averaged across observers. The black solid line is the best linear fit. The data for the three- and nine-item conditions are a “raw” version of the data used as the basis for the postfilter efficiency analysis; this is a raw summary because it does not depend on the influence-function analysis. The singleton data were not included in the postfilter efficiency analysis. Given that the stimulus items ranged in size from 0.22° to 0.99°, the standard deviation of 0.17°, 95% CI [0.15°, 0.19°], of the singletons suggests that observers were able to perceive and then recall a single size fairly accurately with the adjustment procedure used here.
Because the spacing on the abscissa of
Figure 11 is logarithmic, it appears that the mean-size error increases linearly with the logarithm of the numbers of items—slope = 0.033, 95% CI [0.020, 0.046],
t(7) = 5.927,
p = 0.001, BF = 63.97. What makes this observation striking is that it suggests that something other than the misperception of the sizes of the items must be contributing to the error observed in the three- and nine-item conditions. We reach this conclusion because the mean-size error due to misperception of item sizes would be expected to decrease as 1 over the square root of the number of observations (items). Under the extreme assumption that all singleton error is due to size misperception, the dashed black line shows the predicted RMSE. Another possibility is that rather than being due to size misperception, the error in the mean-size task arises from “late” sources—i.e., error depending on processes that come after the mean-size estimate has been created. Two examples of late sources of error are memory errors that result from having to keep a perceived mean size in memory while making the response and reproduction errors that arise because of problems correctly reproducing the correctly remembered mean size. One characteristic of late error is that it should not depend on the number of items included in the mean. Thus, an alternative but equally extreme model based on the assumption that all size error arises from late sources predicts that the dashed line in
Figure 11 should be flat. However, neither size misperception errors, late errors, nor some combination of the two predict the observed increase in the RMSE with an increasing number of items. This argument suggests that there is some other component of error in the mean-size task that produces the observed increase in RMSE with
n.
One clue that at least some of the error in the mean-size task results from the misperception of size is that, not surprisingly, the variability of the error increased with the size of the item being reproduced. To quantify this, a Markov-chain Monte Carlo simulation was used to fit a three-parameter model to the singleton data from the mean-size task. The three parameters were bias (the amount that an observer systematically over- or underestimated the size of the item) and the two parameters of a linear model for the standard deviation of the size response error (an additive term and a slope). This analysis showed that there may have been a slight bias—in this case, a tendency for observers to underestimate the true item size, −0.042°, 95% CI [−0.092°, 0.008°], t(7) = −1.979, p = 0.088, BF = 0.80—but the evidence for this is weak. There was evidence for an additive component of the standard deviation of the error, 0.062°, 95% CI [0.031°, 0.094°], t(7) = 4.691, p = 0.002, BF = 21.53, and even stronger evidence that the standard deviation of the size error also increased as the size of the item being estimated increased, 0.106, 95% CI [0.072, 0.140], t(7) = 7.300, p = 0.00016, BF = 182.9. One way to get a sense of the relative importance of the additive and multiplicative contributions to the standard deviation is to compare the contribution of the multiplicative component for an average-size item (0.485°) with that of the additive component: 0.485° × 0.106/0.062° = 1.19. This suggests that the additive and multiplicative components contribute about equally to the standard deviation of the size estimation error for the singletons, with the multiplicative component possibly being slightly stronger.
In the singleton task, size responses were strongly correlated with item size (
r = 0.86). That correlation, along with the previous comparison showing that the multiplicative component made a substantial contribution to the overall error in size judgments for singletons, gives us confidence that observers were able to perceive the size differences of the stimuli used and report sizes using the response method employed in this experiment. Another window on the accuracy with which the item sizes could be perceived in the stimulus displays is provided by a comparison of the results in the size-weighted and equi-weighted centroid tasks. This comparison was done by extending the postfilter efficiency analysis (see the description in the Analysis subsection under
Methods) to allow for the perturbation of item sizes. For each observer, the analysis of the data from the size-weighted centroid task used the estimated postfilter efficiency from the equi-weighted centroid task as a fixed value determining what proportion of the items in a stimulus cloud would be retained after the simulated decimation process. In addition, in this expanded analysis the size of each stimulus item was randomly perturbed prior to computing the simulated centroid judgment. The size perturbations were drawn from a Gaussian distribution with mean 0 and a standard deviation that depended on item size. The MATLAB optimization function fmincon() was used to estimate the slope and intercept of a linear function relating the standard deviation of item perturbation to item size so that the centroid response error produced in the simulation matched that produced by the observer in the size-weighted centroid task.
Starting with the approximation that the centroid response error in the equi-weighted task does not reflect the size variation of the stimulus items, if one also accepts the assumption that additional centroid response error observed in the size-weighted task is only due to incorporating size information into the centroid judgments (and not, for example, the recruitment of some completely different centroid judgment process), then the size error estimated by this expanded analysis provides an upper bound on the variability in misperception of size for these stimuli. This is an upper bound because all of the additional centroid response error in the size-weighted task is ascribed to size misperception; however, it seems plausible that some of the additional error is introduced by the process of forming a size-weighted centroid.
For the size-weighted centroid of three items, this elaboration of our postfilter efficiency analysis estimated the additive component of the size misperception error to be 0.053°, 95% CI [0.033°, 0.074°], t(7) = 6.085, p = 0.0005, BF = 72.78; for nine items it was 0.044°, 95% CI [0.033°, 0.055°], t(7) = 9.366, p = 0.0000, BF = 692.5. Because there is only weak evidence for a difference between these estimates, Δ = 0.010°, 95% CI [−0.006°, 0.026°], t(7) = 1.490, p = 0.180, BF = 1.314, we will consider their average, 0.049°, 95% CI [0.034°, 0.063°], t(7) = 7.837, p = 0.0001, BF = 265.4. The slope relating the size misperception error to item size for the three-item task was 0.050, 95% CI [0.002, 0.099], t(7) = 2.445, p = 0.044, BF = 2.073; for nine items it was 0.038, 95% CI [−0.010, 0.085], t(7) = 1.859, p = 0.105, BF = 1.102. Because there is only weak evidence for a difference between these estimates, Δ = 0.013, 95% CI [−0.069, 0.094], t(7) = 0.369, p = 0.723, BF = 0.356, we will consider their average, 0.044, 95% CI [0.018, 0.070], t(7) = 4.045, p = 0.005, BF = 11.473. What is striking here is that the estimate of the additive component of the size misperception error computed in this way is similar to that estimated previously for the singleton trials in the mean-size task—0.049° versus 0.062°, Δ = 0.014°, 95% CI [−0.025°, 0.053°], t(7) = 0.850, p = 0.423, BF = 0.451—but the slope of the multiplicative component is substantially smaller: 0.049 versus 0.106, Δ = 0.057, 95% CI [0.022, 0.093], t(7) = 3.804, p = 0.007, BF = 8.975. We interpret this as evidence that the information about the size of the stimulus items in the size-weighted centroid task is more accurate than that incorporated into the mean-size judgments.
If, as these analyses of the of the size-weighted centroid task suggest, the sizes (and locations) of up to nine items can be perceived accurately and incorporated effectively into a centroid judgment, why are the mean-size judgments so inefficient? The foregoing analysis suggests that, at least in part, this reflects degradation in the quality of the size information available to the mean-size calculation. However, the data summarized in
Figure 11 suggest that the problem goes further than this. One possibility is that the calculation of the mean size itself is a substantial source of error. The fact that size information can be used effectively in the size-weighted centroid task suggests that the brain has processes that can accurately perceive and calculate with this information, but apparently the mean-size responses do not tap these processes. Ours is not the only demonstration that comparing the mean size of a set of items with the size of a single item could be problematic; Chong and Treisman (
2003) have found reduced thresholds when asking observers to compare the mean size of two stimulus arrays, even when they were presented sequentially. One speculation about the source of this difference between the centroid and mean-size tasks is that the centroid judgments may be produced by a mechanism in the dorsal visual pathway, whose purpose is to guide movements (Goodale & Milner,
1992). In this interpretation, mean-size judgments result from a ventral mechanism that either has poor access to size information or combines that information inefficiently.
An issue that presents a potential complication for the interpretation of these results is that, depending on the task, observers may be registering size in different ways. Because it is a reproduction task, the mean-size task requires observers to register and then produce their judgment using absolute sizes. By contrast, in the size-weighted centroid task they could be using relative sizes; it is possible to perform this task perfectly well with size information that preserves only the proportional sizes of the stimuli. We should point out, however, both that there is nothing in our results that suggests that observers were, in fact, using relative size estimates in the size-weighted centroid task and that we unaware of any literature that shows that using such relative sizes would be easier than actually using absolute sizes. Also, as discussed previously in the analysis of the singleton data from the mean-size task, there is evidence that suggests that, at least in this case, observers were able to perceive and report absolute size with good accuracy.
A secondary result is that there was no effect on performance due to the two types of squares used in this experiment. Both influence functions and efficiencies were very similar for both outlined and filled squares. These findings suggest that observers are actually using the sizes of the squares to make their judgments and are not being influenced by the luminance of the screen (e.g., using mean luminance to make their estimation).
With the aim of exploring if there are systematic, individual differences across tasks, we conducted a correlation analysis of the efficiencies for all four variants of the three tasks—i.e., the variants due to stimulus type and set size. These correlations, averaged over stimulus type and set size, are summarized in
Table 1. There was a strong, positive correlation of the postfilter efficiency estimates both within (i.e., across the variants) and across the two centroid tasks, suggesting that the differences in postfilter efficiency across observers in these tasks reflect a common mechanism. In contrast, there was little or no correlation among the variants of the mean-size tasks or between them and the centroid tasks. Given that there are large postfilter efficiency differences across observers and the variants of the mean-size task (ranging from 0.2 to almost 0.9), these correlations close to 0 suggest two separate conclusions. First, the postfilter efficiency variations across observers in the mean-size task derive from a different source than those in the size-weighted centroid task. Even more troubling for the use of the mean-size task to estimate of the amount of size information available to an observer is the lack of correlation across its variants, which suggests that any variation across observers in their ability to make mean-size judgments is swamped by other, unrelated sources of error. Of course, since these correlations are being computed based on only eight observers, these estimates are not precise; however, the differences are large enough to suggest that there is an effect here worth considering.