Here, we show the exact quantities involved in the calculation of the accuracy measurements and the corresponding binomial 95-percent confidence intervals in
Figure 4. Each psychophysical measurement (trial) that constitutes a data point (classification accuracy of a given age group or model at a certain distortion level) is treated as an independent Bernoulli trial such that a “success” corresponds to a correct classification and a “failure” to a misclassification. Accordingly, we assume that the number of “successes” is a random variable
X following a binomial distribution \(\mathcal {B}(n,p)\), whereby
n is the number of observations and
p is the probability of “success”. We estimate
p by the sample proportion \(\hat{p} = X/n\), which is essentially the reported accuracy given by dividing the number of correctly classified images by the total number of trials. We calculate the binomial 95-percent confidence intervals for each data point by the following formula (Wald method):
\begin{eqnarray*}
\hat{p}\pm \sqrt[Z]{\frac{\hat{p}(1-\hat{p})}{n}},
\end{eqnarray*}
whereby
z is the quantile of a standard normal distribution corresponding to the target error rate α. For a standard two-tailed 95-percent confidence interval, α = 0.025 and thus
z = 1.96. To give an example (row indicated by an arrow), consider the classification accuracy of 4- to 6-year-olds on salt-and-pepper noise images (
Difficulty =0.35). Out of 310 collected trials (
n), 4–6 year-olds classified 95 images correctly (
X). This allows us to calculate the classification accuracy (sample proportion) by dividing the number of correctly classified images by the total number of trials: \(\hat{p}=95/310=0.306\). We then use \(\hat{p}\) as an estimator of the true population accuracy
p such that \(X\sim \mathcal {B}(310,0.306)\). According to the above formula, binomial confidence intervals can now be calculated by
\begin{eqnarray*}
0.31\pm \sqrt[1.96]{\frac{0.306(1-0.306)}{310}},
\end{eqnarray*}
resulting in a 95% confidence interval of [0.282, 0.330] around the mean accuracy of 0.306. Pairwise comparing confidence intervals between different observers allows for determining whether the two corresponding classification accuracy estimations differ significantly. For example, the classification accuracy of the ResNeXt model (row indicated by an arrow) for heavily distorted salt-and-pepper noise images is 0.075 with a confidence interval of [0.048, 0.102]. Since the confidence intervals [0.282, 0.330] and [0.048, 0.102] do not overlap, we conclude that the classification accuracy differs significantly between 4- to 6-year-olds and the ResNeXt model.