Measuring sensitivity is at the heart of psychophysics. Often, sensitivity is derived from estimates of the psychometric function. This function relates response probability to stimulus intensity. In estimating these response probabilities, most studies assume stationary observers: Responses are expected to be dependent only on the intensity of a presented stimulus and not on other factors such as stimulus sequence, duration of the experiment, or the responses on previous trials. Unfortunately, a number of factors such as learning, fatigue, or fluctuations in attention and motivation will typically result in violations of this assumption. The severity of these violations is yet unknown. We use Monte Carlo simulations to show that violations of these assumptions can result in underestimation of confidence intervals for parameters of the psychometric function. Even worse, collecting more trials does not eliminate this misestimation of confidence intervals. We present a simple adjustment of the confidence intervals that corrects for the underestimation almost independently of the number of trials and the particular type of violation.

*Absolute sensitivity*is inversely related to the horizontal shift of the psychometric function along the stimulus intensity axis. To refer to the “horizontal shift,” psychophysicists often use the word “threshold,” being aware of the fact that theories that assume a hard threshold cannot explain decision behavior in psychophysical tasks (Swets, 1961). The psychometric function also provides information about sensitivity to differences or

*relative sensitivity*: If the psychometric function is very steep, this indicates that an observer can discriminate small stimulus differences; if the psychometric function is very shallow, the observer can only discriminate relatively coarse differences. Different procedures have been proposed to determine thresholds. A number of authors have proposed adaptive procedures (Alcalá-Quintana & García-Pérez, 2005; Cornsweet, 1962; Dixon & Mood, 1948; García-Pérez & Alcalá-Quintana, 2007; Kontsevich & Tyler, 1999; Levitt, 1970; Watson & Pelli, 1983); others have used the method of constant stimuli (Treutwein & Strasburger, 1999; Ulrich & Miller, 2004; Zychaluk & Foster, 2009). One approach that has proven particularly successful consists of (a) recording a block of trials for each stimulus intensity of interest (Blackwell, 1952) and (b) fitting a parametric model to the response counts in each block (Kuss, Jäkel, & Wichmann, 2005; Wichmann & Hill, 2001a).

*w*(see Figure 2) decreased to 48% of its initial level. Averaged over the largest experiment, the parameters of the learning observer's psychometric function were the same as for the binomial observer. More details about this observer can be found in Appendix A.2. Sample psychometric functions for this observer are shown in Figure 1b.

^{1}Figure 1c illustrates this point: The shading of a point represents the probability of sampling the response probability on the

*y*-axis given that a stimulus with the intensity on the

*x*-axis is presented. The mode of the Beta distribution at each stimulus level was selected equal to the psychometric function of the binomial observer. However, the standard deviation of the response counts was increased by approximately 15%. More details about this observer can be found in Appendix A.3.

*θ*= (

*m*,

*w*,

*γ*,

*λ*). The first two parameters

*m*and

*w*determine the shape of the psychometric function;

*γ*is the guessing rate and

*λ*is the lapse rate. The sigmoidal function

*F*: → [0, 1] is required to be monotonically increasing. The meaning of these parameters is illustrated in Figure 2.

^{2}Depending on the sigmoidal function, the first two parameters,

*m*and

*w*, have different interpretations.

*z*(

*α*) = 2log(1/

*α*− 1). In this case,

*m*is the horizontal position of the sigmoid, the point at which the sigmoid is halfway between its lower asymptote and its upper asymptote. In many studies,

*m*is taken to be the observers’ threshold estimate. The second parameter

*w*is the width of the interval in which the sigmoid rises from

*α*to 1 −

*α*. This is the width of the “interesting” region of the sigmoid, where changes in intensity result in large behavioral changes. The width is measured in units of stimulus intensity. We selected

*α*= 0.1. The logistic function was used as the generating sigmoid for all simulated observers presented in this paper. In addition, it is the standard model for our analyses.

*m*is related to the horizontal position of the sigmoid, while

*w*is inversely related to the width of the interval on which the function rises. However, this relation is less obvious in this case: The

*m*parameter does not correspond to a pure horizontal shift, and the

*w*parameter changes the shape of the function instead of simply scaling the sigmoid along the

*x*-axis.

^{3}

*z*(

*α*) = log(−log(

*α*)).In this case, the interpretation of the parameters

*m*and

*w*is again precisely the same as for the logistic function:

*m*is the horizontal position of the sigmoid and

*w*is the width of the interval in which the sigmoid rises from

*α*to 1 −

*α*.

*m*∼

*w*∼ Gamma(1.01, 2000),

*λ*∼ Beta(2, 50) (

*γ*∼ Beta (1, 10)). Both fitting procedures can be used to obtain estimates of the psychometric function as well as credible intervals (i.e., confidence intervals for bootstrap sampling and posterior intervals for Bayesian inference). Details about the fitting procedures can be found in 4.

^{4}for parameters of the psychometric function match the desired 95%. This can be quantified in terms of coverage: The coverage of a credible interval is the probability that the credible interval contains the true value.

^{5}To estimate this probability, we simulated each combination of observer and fitting procedure 1000 times. From these 1000 repetitions, we calculated the fraction of simulations for which the generating parameters (of the observer) were actually within the estimated credible intervals. Thus, all reported coverages refer to fractions of 1000 repetitions in which the estimated credible intervals contained the generating parameters of the observer. This was repeated for different numbers of blocks and different block sizes. In total, we determined 3 (observers: binomial, betabinomial, learning) × 3 (psychometric function models: logistic, Weibull, reversed Gumbel) × 2 (tasks: yes–no or 2AFC) × 3 (inference paradigms: parametric bootstrap, nonparametric bootstrap, Bayesian inference) × 1000 (repetitions) = 52,000 credible interval estimates for each combination of block size (10, 20, or 40 trials per block) and number of blocks (4, 6, 8, 12, or 24 blocks), totaling 780,000 estimated credible intervals with 40 to 960 trials per psychometric function.

*m*. Wichmann and Hill (2001b) deal with this problem by exploring the likelihood function in a neighborhood of the maximum likelihood estimate and extending their confidence limits accordingly. Here, we see that an alternative solution might be estimation of Bayesian posterior intervals: Bayesian posterior intervals have the desired coverage of 95% already after 80 trials (40 trials for threshold only). Although bootstrap confidence intervals for the width

*w*of the interval on which the psychometric function rises have coverages close to 95%, the underestimation of the confidence intervals is also obvious in this case. The nonparametric bootstrap procedure does not assume any dependencies between different blocks in the resampling process. Thus, one would expect the nonparametric bootstrap to be more robust with respect to violations of the binomiality assumption that goes into the fitting process. However, this is not the case. The two bootstrap procedures do not differ much with respect to their coverages.

*γ*. Coverages for the simulated yes–no task showed a similar pattern as for the 2AFC task: Coverage for the binomial observer was close to 95% in the Bayesian setting and approached 95% with increasing number of trials for the two bootstrap procedures; coverages for the two nonstationary observers were clearly too low. Although coverages for the two bootstrap procedures approached 95%, it remained between 90% and 95% for above 400 trials (200 trials if only width is considered) and did not reach 95% for any number of simulated trials. Importantly, coverage determined for the learning observer was no better than for a betabinomial observer. This highlights the fact that there is no ordering between the different observers: The learning observer does not generally result in better credible intervals than the betabinomial observer. This clearly depends on the parameters used for our simulations of the simulated observers. The choice of parameters is justified in detail in the Discussion section. In experiments with 40 trials per block, coverages as low as 55% (scaling factor nearly 3) were observed in the worst case.

*F*(see Equation 1) for generating and analyzing the data. The generating sigmoid was, as always, the logistic. The analyzing sigmoidal functions were the Weibull cumulative distribution function (see Equation 3) and the Gumbel on reversed

*x*-axis (see Equation 4). If the sigmoidal function used to analyze the data differed from the sigmoidal function used to generate the data, coverage was slightly worse for low numbers of trials (10% less coverage with less than 100 trials per psychometric function). All other observations were still valid: coverage was around 70–90% for nonstationary observers, it was worse with smaller blocks, and Bayesian inference resulted in more realistic credible intervals than the two bootstrap procedures. Results of all other analyses reported here were virtually the same. Therefore, we only report results for those cases in which the analytical form of the generating and analyzing psychometric functions was the same.

*p*-value was calculated, which is the fraction of simulation runs for which the deviance on the simulated data set was larger than the deviance on the original data set (Gelman & Meng, 1996). If the Bayesian

*p*-value is close to 0 or 1, the data set is unlikely to result from the fitted joint distribution with respect to its deviance. Because the joint distribution of data and model parameters was fitted based on the assumption of a binomial observer, this represents evidence for nonstationarity. We quantified the power of this test by analyzing how well the test discriminated between a binomial and a betabinomial observer.

*p*-value for the correlations between deviance residuals and recording sequence.

*i*on the estimated model parameters, we perform two fits. The first fit is performed on the complete data set, while the second fit is performed on a modified data set in which the

*i*th data block has been omitted. If the fitted model parameters differ between the complete data set and the modified data set, this is evidence that the

*i*th data block has an undue effect on the estimated model parameters. We can extend this strategy to obtain a continuous measure of influence. To this end, we scaled the difference between the complete data fit and the modified data fit such that a difference of 1 means “the modified data fit is exactly on the 95% confidence limit of the complete data fit.” This scaling has been applied to each parameter in isolation. Subsequently, the influence of a single data block was the maximum influence it had on any parameter in isolation.

*p*and

*q*, the Kullback–Leibler divergence of

*p*under

*q*is, in general, not equal to the Kullback–Leibler divergence of

*q*under

*p*. We quantified the influence of block

*i*by the Kullback–Leibler divergence of the complete data posterior distribution under the modified data posterior distribution. Thus, high influence is assigned to blocks for which the modified data posterior distribution has a lot of mass in areas in which the complete data posterior distribution does not have mass. Such blocks either result in a shift of the posterior distribution or make the posterior distribution considerably sharper. A way how Kullback–Leibler divergence can be derived from samples from the posterior is outlined in 5.

*p*-value. This value does not necessarily lead to the predefined

*α*level of 5%. In fact, it does not in this case. The true

*α*level for decisions based on posterior predictive simulations of the deviance is at about 2.5% for trial numbers larger than 200. This implies that the test could be more sensitive if a less conservative criterion on the Bayesian

*p*-value was chosen. The Bayesian

*p*-value incorporates the probabilities that the observed data are unexpectedly good or that the observed data are unexpectedly bad. The observed 2.5% true

*α*level matches well with the fact that typically the fit of the observed data was worse than that for the simulated data and, thus, the test was, in fact, a one-sided test.

*N*= 960 trials. Similar to the results observed in the maximum likelihood setting, designs with small blocks yielded higher power for the detection of a learning observer. Thus, the tests for nonstationarity gave very similar results no matter whether they were generated based on maximum likelihood simulation or posterior predictive simulation.

*N*< 300) statistical tests to detect nonstationarities have insufficient power. Here, we present a procedure for correcting credible intervals that have been derived assuming a stationary binomial observer although the observer actually was not stationary. The basic idea is simple: nonstationarity always includes dependencies between trials. If we recorded

*n*trials in an experiment, these dependencies imply that the number of

*independent*trials we recorded is less than

*n*. Here, we show that based on the residuals of a psychometric function that was estimated assuming stationarity, it is possible to estimate the fraction of trials that were actually independent.

*n*

_{ i }of trials in the

*i*th block is reduced to

*νn*

_{ i }trials, with some

*ν*∈ (0, 1]. If

*ν*= 1, then the trials seem to be sufficiently independent such that no reduction in the trial number is necessary. To estimate

*ν*, we assume that the data vary around the estimated psychometric function according to a Beta distribution and choose

*ν*to maximize the corresponding likelihood function. Details of this procedure are given in 6. To illustrate the validity of

*ν*, we plotted the total effective number of trials

*νn*over the total number of recorded trials in Figure 6. For a binomial observer, the effective number of trials

*νn*is in close agreement with the number of recorded trials. Note, however, that the effective number of trials is slightly underestimated even for the binomial observer: Ideally,

*ν*should be one and all the data points for the binomial observers should be on the dotted diagonal. For the learning observer (light blue symbols) and for the betabinomial observer (red symbols), the plotted curves are clearly below the diagonal. This indicates that the effective number of trials is less than the number of recorded trials.

*ν*, we can apply one of two different procedures to correct our inference for violations of the independence assumption:

*c*and the parameter estimate

*c*=

*θ*

^{(i)}. This correction is applied after the inference has already been performed. We will, therefore, refer to it as

*correction of inference*.

*k*

_{ i }and the number of trials

*n*

_{ i }in the

*i*th block by a factor

*ν*:

*x*] denotes the largest integer

*n*such that

*n*≤

*x*. This correction is applied to the data before inference has been performed. We will, therefore, refer to it as

*correction of data*.

*ν*can be close to zero. In that case, we might end up with blocks that effectively contain no trials anymore. Even worse, we might even find that no data are left at all from our experiment. This did indeed happen in a number of our simulation runs (up to 10% for the betabinomial observer, about 3% for the learning observer, and about 1% or below for the binomial observer).

^{6}In these cases, correction of data was not applicable.

*ν*from each simulated data set and corrected credible intervals accordingly. Figure 7 shows the results of the correction of inference. For bootstrap-based inference and with increasing trial number, the true coverage of 95% confidence intervals approaches the nominal coverage of 95%. However, serious underestimation of confidence intervals still occurs with less than approximately 500 trials per psychometric function—the problem, thus, remains severe for typical psychophysical experiments. Even worse, the underestimation of confidence intervals remains visible even for the highest trial number tested. The situation is different for Bayesian inference, however: For Bayesian inference, the true coverage is very close to 95% even for relatively small trial numbers. For the binomial observer, posterior intervals are slightly too broad resulting in coverages >95%, in accordance with the slight underestimation of the effective number of trials that was observed for the binomial observer in Figure 6.

*ν*was between 0.37 and 0.91 corresponding roughly to the quartile range of our simulated learning observer (quartile range of 0.37 to 1). Correction of inference would increase the credible intervals by 5% to 64%. We also analyzed data from 6 listeners from an auditory experiment (kindly provided by Schoenfelder & Wichmann, 2010). Listeners had to detect sine tones that were masked by band-limited noise. In total, each listener performed between 30,000 and 40,000 trials. Out of these trials, we analyzed the first 1600 trials and estimated

*ν*-values between 0.19 and 0.62. This coincided roughly with the quartile range of our simulated betabinomial observer, which was 0.17 to 0.77. Correction of inference would increase the credible intervals by 27% to 129%. Note, however, that in both of these studies, the data we analyzed were excluded from the final analysis of the data due to the observed nonlinearities. These data indicate that the simulated nonstationarities we used were at least roughly comparable to empirically observed nonstationarities. We also expect that stronger nonstationarities (i.e., lower

*ν*) will be observed for children, patients, or elder people.

*χ*

^{2}-distribution with the degrees of freedom equal to the number of data blocks minus the number of parameters. The

*M*parameter of the betabinomial observer was adapted to yield rejection rates of 30%, 60%, and 90% for this asymptotic distribution.

^{7}Figure 9 illustrates the results of these additional simulations: The left panel shows the actual rejection probabilities that were observed in the simulations. The targeted values are not hit precisely, but the observed rejection probabilities now cover the whole range between 0 and 1 as intended. The middle panel presents uncorrected coverages. It is clear that nonstationarities that are easy to detect (90% desired rejection rates) result in very severe underestimation of credible intervals. Furthermore, nonstationarities that typically pass the deviance test (30% desired rejection rates; actually closer to 10%) still result in significant underestimation of credible intervals: The credible intervals had to be scaled by a factor of 1.2 to obtain the correct size. Thus, these data sets were very hard to tell apart from being generated by a stationary observer using diagnostic tests while still leading to 20% too small confidence intervals with only 85–90% coverage instead of the intended 95%. The right panel presents coverages after correction of inference. Similar to Figure 7, coverages are now very close to the desired 95%. This illustrates two points: First, nonstationarity may have an impact on credible intervals even though the nonstationarity is not detected in any formal test. Second, correction of inference also works for these weakly nonstationary data sets.

*not*argue that these models are correct nor that they are the only mechanisms that could generate nonstationary behavior. Thus, we do not select among different competing models because we do not know the competitors. The tests presented here do not require knowledge of alternative models. They can detect violations of the model's assumptions without presenting an alternative. We also believe that in combination with the proposed correction method for credible intervals, these tests provide a reasonable way for dealing with nonstationary observers: The tests are simple and come with virtually no additional computational burden over the Monte Carlo techniques that are used to derive the credible intervals. The correction procedure is simple and intuitive. In addition, it provides a graded measure of the severity of the violations of the assumptions.

*K*different stimulus intensities

*x*

_{1}, …,

*x*

_{ K }. Stimulus intensity

*x*

_{ i }was given by

*y*

_{ i }s were selected to linearly span the range from 0.01 to 0.99, independent of the value of

*K*.

*k*

_{ i }were sampled from a binomial distribution

*k*

_{ i }∼ Binom(

*n*

_{ i },

*ψ*

_{ i }), where

*γ*is the guessing rate of the observer. For

*n*-AFC tasks,

*γ*was fixed to 1/

*n*. For a yes–no task, we set

*γ*= 0.02. The lapse rate

*λ*was always set to

*λ*= 0.02. The sigmoid function

*F*was a logistic function as parameterized in Kuss et al. (2005):

*m*= 4 and the width of the rising interval of the psychometric function

*w*= 2. The value

*z*(

*α*) = 2log(1/(

*α*− 1)) was set such that

*w*is the width of the interval on which

*F*rises from 0.1 to 0.9.

*m*and

*w*were both functions of the trial number. Initially, we used

*m*

_{0}= 4.7 and

*w*

_{0}= 2.7. The parameters on trial

*t*were then recursively defined by

*u*is either

*m*or

*w*and

*E*

_{ m }= 3.3,

*E*

_{ w }= 1.3, and

*τ*= 40. The observer improved to an asymptotic performance level that is given by

*m*

_{∞}=

*E*

_{ m },

*w*

_{∞}=

*E*

_{ w }. The speed at which performance approaches the asymptotic level is quantified by

*τ*: The higher the value of

*τ*, the smaller is the improvement in performance on each trial and the slower the performance approaches the asymptotic level. To analyze coverage for the learning observer, the generating parameters were averaged over time.

*k*

_{ i }for block

*i*were sampled from a binomial distribution

*k*

_{ i }∼ Binom(

*n*

_{ i },

*p*

_{ i }). In contrast to the binomial observer, the success probability

*p*

_{ i }was a random variable. Thus,

*p*

_{ i }was sampled from a Beta distribution:

*α*− 1 successes and

*β*− 1 misses have been observed. Thus, it is a reasonable choice to select

*α*

_{ i }and

*β*

_{ i }such that

*M*:=

*α*+

*β*: The higher the value of

*M*, the lower the variance of the Beta distribution. We selected

*M*= 10 for our simulations.

*λ*was constrained to the interval [0, 0.1]. In yes–no tasks, the guessing rate

*γ*was also constrained to this interval.

*i*is given by Ψ(

*x*

_{ i };

*i*is given by

*k*

_{ i }/

*n*

_{ i }and trials for that block are then generated from a binomial distribution. For each random data set, a new psychometric function was fitted. We generated 2000 bootstrap samples to obtain an approximation of the sampling distribution of the parameters. From these samples, 95% bias corrected and accelerated confidence intervals (Davison & Hinkley, 1997; Hill, 2001; Wichmann & Hill, 2001b) were derived.

*f*and

*g*. In our current application, we know

*f*and

*g*only up to an unknown scaling factor. That is, we could write

*Z*

_{f}. The same holds for

*g*. The Kullback–Leibler divergence of

*f*under

*g*is given by where denotes expectation under

*f*. The expectation on the right-hand side cannot be evaluated directly because we do not know the scaling factors. We write

*F*(

*θ*) = −log

*θ*),

*G*(

*θ*) = −log

*θ*) and can then use importance sampling to obtain (Bishop, 2006, p. 554)

*θ*

^{(ℓ)}, ℓ = 1, …,

*N*are samples from

*f*.

*F*(

*θ*

^{(ℓ)}) −

*G*(

*θ*

^{(ℓ)}) with each sample

*θ*

^{(ℓ)}in order to approximate the Kullback–Leibler divergence of

*f*under

*g*.

*k*

_{ i }is the number of correct responses that were obtained from the

*n*

_{ i }trials presented at stimulus level

*x*

_{ i }.

*i*goes from 1 to

*K*, where

*K*is the number of blocks. We used the abbreviation Ψ

_{ i }= Ψ(

*x*

_{ i }∣

*ν*= 1, Equations D1 and D2 are equivalent. However, we can now find a

*ν** ∈ (0, 1] such that the log likelihood

*f*: (0, 1) → is the density of the Beta distribution. This

*ν** will then be the estimated fraction of effectively independent trials.

*νn*

_{ i }is less than 1, [

*νn*

_{ i }] will be zero.

*χ*

^{2}-distribution (Wichmann & Hill, 2001a).

*r*. Journal of Statistical Software, 34, 1–24.