The selection of the observers was relatively ad hoc: With different generating parameters, different results might have been obtained.
The estimated fractions of effective trial numbers can easily be compared to data from real psychophysical experiments. We analyzed data from 5 observers that detected luminance increments in different high-level contexts (three contexts, data kindly provided by Maertens & Wichmann,
2010). All observers showed at least some evidence for learning. The estimated
ν was between 0.37 and 0.91 corresponding roughly to the quartile range of our simulated learning observer (quartile range of 0.37 to 1). Correction of inference would increase the credible intervals by 5% to 64%. We also analyzed data from 6 listeners from an auditory experiment (kindly provided by Schoenfelder & Wichmann,
2010). Listeners had to detect sine tones that were masked by band-limited noise. In total, each listener performed between 30,000 and 40,000 trials. Out of these trials, we analyzed the first 1600 trials and estimated
ν-values between 0.19 and 0.62. This coincided roughly with the quartile range of our simulated betabinomial observer, which was 0.17 to 0.77. Correction of inference would increase the credible intervals by 27% to 129%. Note, however, that in both of these studies, the data we analyzed were excluded from the final analysis of the data due to the observed nonlinearities. These data indicate that the simulated nonstationarities we used were at least roughly comparable to empirically observed nonstationarities. We also expect that stronger nonstationarities (i.e., lower
ν) will be observed for children, patients, or elder people.
Figure 4 indicates that only a part of the space has actually been explored with the current parameter settings. All simulated data for the betabinomial observer scatter around a well-defined curve that monotonically increases with trial number. However, not all types of nonstationarity have to be on this curve. For instance, an observer with weakly nonstationary behavior might have been tested in 500 trials. However, the nonstationarity might be weak enough to be detected by the deviance test only in 30% of the cases. The presented data do not provide much information as to how severe the misestimation of credible intervals is in such a case. We, therefore, repeated simulations for the betabinomial observer. In these simulations, the strength of the nonstationarity was varied to target-specific areas of the plot in
Figure 4: These areas corresponded to rejection probabilities of approximately 30%, 60%, and 90% in order to cover the whole range of rejection probabilities. For a binomial observer, deviance asymptotically follows a
χ 2-distribution with the degrees of freedom equal to the number of data blocks minus the number of parameters. The
M parameter of the betabinomial observer was adapted to yield rejection rates of 30%, 60%, and 90% for this asymptotic distribution.
7 Figure 9 illustrates the results of these additional simulations: The left panel shows the actual rejection probabilities that were observed in the simulations. The targeted values are not hit precisely, but the observed rejection probabilities now cover the whole range between 0 and 1 as intended. The middle panel presents uncorrected coverages. It is clear that nonstationarities that are easy to detect (90% desired rejection rates) result in very severe underestimation of credible intervals. Furthermore, nonstationarities that typically pass the deviance test (30% desired rejection rates; actually closer to 10%) still result in significant underestimation of credible intervals: The credible intervals had to be scaled by a factor of 1.2 to obtain the correct size. Thus, these data sets were very hard to tell apart from being generated by a stationary observer using diagnostic tests while still leading to 20% too small confidence intervals with only 85–90% coverage instead of the intended 95%. The right panel presents coverages after correction of inference. Similar to
Figure 7, coverages are now very close to the desired 95%. This illustrates two points: First, nonstationarity may have an impact on credible intervals even though the nonstationarity is not detected in any formal test. Second, correction of inference also works for these weakly nonstationary data sets.