In psychophysical studies, the psychometric function is used to model the relation between physical stimulus intensity and the observer’s ability to detect or discriminate between stimuli of different intensities. In this study, we propose the use of Bayesian inference to extract the information contained in experimental data to estimate the parameters of psychometric functions. Because Bayesian inference cannot be performed analytically, we describe how a Markov chain Monte Carlo method can be used to generate samples from the posterior distribution over parameters. These samples are used to estimate Bayesian confidence intervals and other characteristics of the posterior distribution. In addition, we discuss the parameterization of psychometric functions and the role of prior distributions in the analysis. The proposed approach is exemplified using artificially generated data and in a case study for real experimental data. Furthermore, we compare our approach with traditional methods based on maximum likelihood parameter estimation combined with bootstrap techniques for confidence interval estimation and find the Bayesian approach to be superior.

*threshold*. These so-called

*adaptive methods*vary the stimulus strength based on previous responses of the observer; adaptive methods can be divided into nonparametric (Garcia-Perez, 1998; Rose, Teller, & Rendleman, 1970; Taylor, 1971; Wetherill & Levitt, 1965) and parametric (Alcalá-Quintana & Garcia-Pérez, 2004; King-Smith & Rose, 1997; Kontsevich & Tyler, 1999; Madigan & Williams, 1987; Pelli, 1987; Pentland, 1980; Snoeren & Puts, 1997; Watson & Pelli, 1983; Watt & Andrews, 1981), the latter including some methods that are explicitly Bayesian (Alcalá-Quintana & Garcia-Pérez, 2004; Kontsevich & Tyler, 1999; Watson & Pelli, 1983). For a review of some of the most common adaptive methods, see Treutwein (1995).

*parametric psychometric function*

*F*(

*x,θ*) parameterized by

*θ*, which maps the stimulus intensity

*x*to the [0,1] interval. This function is commonly chosen to have a sigmoidal form like cumulative density functions of various probability distributions. We will discuss several common choices below in Parameterization and prior distributions.

*n*AFC experimental setting, there is a

*chance probability π*

_{c}that the observer “guesses” the correct answer independent of the stimulus. This probability of making the correct guess is usually

*π*

_{c}= 1/

*n*, where

*n*is the number of possible choices (the

*n*in

*n*AFC. In a long sequence of experimental trials, the observer occasionally

*lapses*(i.e., makes a random choice independent of the stimulus). In vision experiments an obvious example is blinking while the stimulus is presented. This probability of lapsing

*π*

_{l}is a nuisance parameter, but it is necessary to take its effect into account in statistical modeling as shown by Wichmann and Hill (2001a, 2001b).

*F*to the probability of giving the correct answer in a single

*n*AFC stimulus presentation. Given the stimulus intensity

*x*, the event of correct discrimination is a Bernoulli variable with probability of success equal to where

*F*(

*x,θ*) characterizes the change of discriminability as a function of the stimulus intensity. The model comes in the form of a mixture of two Bernoulli distributions, which is again a Bernoulli distribution. With probability

*π*

_{l}the observer lapses and has chance

*π*

_{c}to guess the correct answer. With probability (1 −

*π*

_{l}) the observer does not lapse and has a chance of (1 −

*π*

_{c})

*F*(

*x*,θ)+

*π*

_{c}, which is

*F*(

*x,θ*) scaled to the [

*π*

_{c},1] interval, to give the correct answer.

*x*

_{1},…,

*x*

_{k}} of distinct stimulus intensities are used in an experiment, which allows a more compact representation. By aggregating the trials for identical stimulus intensities, we compress the data to a set of triples

*D*= {(

*x,N,n*)

_{i}|

*i*= 1,…,

*k*} such that at contrast

*x*

_{i}we conducted

*N*

_{i}trials and observed

*n*

_{i}correct responses. Because

*n*

_{i}is a sum of Bernoulli variables, it has a binomial distribution where Ψ is given by Equation 1. Equation 2 describes the assumed generative model of the data (i.e., the

*sampling distribution*). Furthermore, read as a function of

*θ*and

*π*

_{l}for observed

*D*, we refer to it as the

*likelihood*of the binomial mixture model.

*π*

_{l}(Wichmann & Hill, 2001a). Furthermore, this model is easy to analyze and efficient to implement. Nevertheless, in data analysis one should always be aware of the model’s assumptions, and the conclusions drawn from an analysis should obviously not be trusted more than the assumptions they are based on, whether we apply Bayesian inference or any other statistical method. For example, the assumption that for a given stimulus intensity the Bernoulli trials all have the same probability of success ignores adaptation processes, learning, and other forms of nonstationarity. For well-trained psychophysical observers this assumption is justified (Blackwell, 1952), but for naïve observers it may not always hold true, and the residuals of the fit have to be examined in detail (Wichmann & Hill, 2001a). A second potential worry concerns the choice or assumption of a particular parametric form of

*F*. Typically, there are few a priori reasons to favor one sigmoidal function over another. In practice, however, the estimates of threshold and slope of the psychometric function rarely differ significantly from one

*F*to another (Wichmann & Hill, 2001b).

*inference*about the parameters

*θ*of a psychometric function

*F*(

*x,θ*) and the lapse probability

*π*

_{l}.

*p*(

*D*|

*ϕ*) be a statistical description of this model where

*D*denotes observable data and

*ϕ*are model parameters. In a nutshell, the problem is that the

*true*generating parameter

*ϕ**is hidden, but by observing data we can reduce our uncertainty about its value. In the Bayesian framework, probability distributions over parameter values are used to describe beliefs and uncertainties about the parameter value in the data-generating process.

*prior*distribution

*p*(

*ϕ*) represents beliefs about the value of the true parameter

*ϕ**previous to an inference step. By inference we refer to the process of integrating the information contained in observed data

*D*and the prior

*p*(

*ϕ*) into a

*posterior*distribution

*p*(

*ϕ*|

*D*). The posterior is obtained according to Bayes’ rule:

*ϕ**are weighted proportionally to their compatibility with the observed data. The weighting is given by the likelihood function, which is

*p*(

*D*|

*ϕ*) as a function of

*ϕ*for given

*D*. Prior and posterior are probability distributions describing two states relative to an inference step and correspond to potentially different beliefs about the value of the parameter that generated the data. For details, the reader is referred to O’Hagan (1994) and Jaynes (2003), to mention only two textbooks on Bayesian statistics.

*F*(

*x,θ*). In psychophysical studies data

*D*= {(

*x,N,n*)

_{i}|

*i*= 1,…,

*k*} are collected to learn about

*θ*and

*π*

_{l}. Again Bayes’ rule describes how the observed data consistently reduce the uncertainty about the underlying value of

*θ*and

*π*

_{l}. Formally, the posterior is obtained according to Bayes’ rule where

*p*(

*θ*) and

*p*(

*π*

_{l}) are prior distributions,

*p*(

*D*|

*θ,π*

_{l},

*π*

_{c}) acts as the likelihood, and

*p*(

*θ,π*

_{l}|

*D,π*

_{c}) is the posterior. The posterior distribution summarizes all information contained in the observations and the prior about

*θ*and

*π*

_{l}. Unfortunately, solving the integral in the denominator appears to be analytically intractable such that the posterior cannot be computed in closed form. Even if we could compute the posterior, the distribution would be of a nonstandard type, and we would be unable to work with it analytically. We therefore have to use approximative techniques to describe the information presented by the posterior.

*decision*problem in which the decision maker minimizes an expected risk, where the expectation is taken with respect to the posterior distribution (Jaynes, 2003). The risk function characterizes the

*loss*associated with a discrepancy between the point estimate and the unknown true parameter value. For example, the expected absolute error is minimized by the median of the posterior distribution (MED). Likewise, minimizing the expected squared error leads to choosing the mean of the posterior (MEAN).

*p*(

*θ*)

*p*(

*π*

_{l}) is taken to be constant (i.e., a flat prior), the maximum likelihood (ML) estimator is derived as a special case.

*γ*∈(0,1), the notion of a confidence region is conceptually different in frequentist and Bayesian statistics (DeGroot & Schervish, 2002).

*γ*confidence region simply as a region in which the true parameter values are believed to lie with probability

*γ*. This can be stated because the parameters are random variables, and we can express our degree of belief for any statement regarding the parameters by evaluating the statement under the posterior distribution.

*γ*. This means that if the experiment is repeated infinitely many times, the proportion of computed regions containing the true value would be

*γ*. For a particular data set, it is not possible to state a probability assignment that the true parameter lies in a computed confidence interval.

*D*have been observed and we want to compute the posterior according to Bayes’ rule Equation 3. A common situation is that we can evaluate the likelihood

*p*(

*D*|

*ϕ*) and the prior

*p*(

*ϕ*) for every possible value of

*ϕ*, but we cannot compute or work with the posterior analytically. MCMC methods sidestep this problem by generating samples from the posterior

*p*(

*ϕ*|

*D*) using only evaluations of the

*unnormalized*posterior

*q*(

*ϕ*|

*D*) =

*p*(

*D*|

*ϕ*)

*p*(

*ϕ*). The idea behind this is that the samples characterize the posterior distribution sufficiently well. In particular, statistics of the samples can be used to approximate properties of the posterior distribution. For example, the mean of the samples is an approximation to the mean of the posterior distribution.

*ϕ*

_{1},

*ϕ*

_{2},…,

*ϕ*

_{n}is generated such that the distribution of

*ϕ*

_{n}asymptotically becomes identical to the posterior as the length of the sequence

*n*increases. In the MCMC terminology, the sequence is called a

*chain*and each element is referred to as a

*state*. In practice the chain is generated for a finite length

*n*and the state

*ϕ*

_{n}is interpreted as a sample of the posterior. The procedure is repeated until enough samples are obtained such that the characteristics of the posterior distribution can be well approximated by statistics of the generated samples.

*ϕ*

_{t}is the current state. To find a valid consecutive state

*ϕ*

_{t+1}, a candidate value

*ϕ′*is proposed from a

*proposal distribution p*(

*ϕ′*|

*ϕ*

_{t}), for example, a Gaussian distribution centered at

*ϕ*

_{t}. The decision whether

*ϕ′*is accepted as consecutive state depends on the ratio

*ϕ′*is accepted if

*π*> 1; otherwise, the probability of acceptance is equal to

*π*. The first fraction in the definition of

*π*captures whether the proposed state is in a region of higher posterior density than the current state. The second term captures the reversibility of the proposed state transition in case the proposal distribution is asymmetric (for the Gaussian example mentioned above this term is equal to 1). Because

*ϕ*

_{t+1}depends only on

*ϕ*

_{t}and not on the history of states, the resulting chain is called a Markov chain. See Figure 1 for an example.

*hybrid*Monte Carlo sampling, which is also known as Hamiltonian sampling, as described by Neal (1993) and MacKay (2003). New states are proposed using a procedure that can be understood as a discrete simulation of Hamiltonian dynamics. The sampling scheme requires additional parameters to be set, namely the number of steps (so-called

*leapfrog*steps) and the step sizes used in the discrete simulation.

*γ*)/2 and (1 +

*γ*)/2 empirical quantiles of the samples as an

*approximate Bayesian γ confidence interval*.

*F*(

*x,θ*) be the psychometric function and

*F*

^{−1}its inverse. In the analysis of psychophysical data, a common interest is to locate the

*threshold m*=

*F*

^{−1}(0.5) and a range for which the detectability varies with the stimulus intensity. A common way of characterizing the sensitivity of an observer is the (inverse) slope of the psychometric function at the threshold location. Another way of describing the range of interest is the

*width w*defined as

*w*=

*F*

^{−1}(1−

*α*) −

*F*

^{−1}(

*α*). This is the length of the interval between

*F*

^{−1}(

*α*), the stimulus intensity at which

*F*(

*x,θ*) =

*α*, and

*F*

^{−1}(1 −

*α*) for some small

*α*. As default we use

*α*= 0.1, such that

*w*is the range in which

*F*ranges from 0.1 to 0.9. (see Figure 2a). The parameterization of the psychometric function in terms of threshold and width has been proposed by Alcalá-Quintana and Garcia-Pérez (2004). An advantage of this parameterization is that

*w*comes in the scale of the stimulus itself, whereas the value of the slope is usually difficult to interpret. Furthermore, the width is more general than the slope in the sense that it can be used in models for which the slope at a particular point does not determine the entire psychometric function.

*F*can be parameterized such that

*θ*= [

*m,w*]. Note that the approach described in this study is not limited to this particular parameterization of

*F*. Many of the functions used to model

*F*also appear in statistical generalized linear models (GLMs) in which they are called response functions (McCullagh & Nelder, 1989).

*probit*response, can be parameterized as where

*z*(

*α*) = Φ

^{−1}(1−

*α*)−Φ

^{−1}(

*α*) using the quantile function Φ

^{−1}(inverse of cdf) of the standard normal distribution. The resulting functions often appear very similar to the logistic (see Figure 2b).

*log-log*model. We use the parameterization where

*z*(

*α*) = ln(−ln(

*α*)). The Gumbel function is asymmetric. For small

*x*the function is similar to the logistic function but approaches 1 faster as the stimulus intensity gets larger. The asymmetry can be reversed, and we obtain the reversed Gumbel function where again

*z*(

*α*) = ln(−ln(

*α*)).

*w*. Instead, we parameterize the function by threshold location

*m*and slope at threshold . So we use the following parametric form for the Weibull function and for the reversed Weibull function. Both Weibull functions are defined for

*x*> 0 and tend to 0 as

*x*→0, which makes them conceptually appealing in many psychophysical settings.

*every*parameter value is equally likely to the scientist. However, this is typically

*not*what the scientists intend: What they want is to allow every model or shape of the model to be equally likely, that is, they want a flat prior in model space. Typically, a flat prior on parameters is, unfortunately, not flat in model space. For psychometric functions this is illustrated in Figure 3. Here we show that using flat priors on the parameters

*θ*strongly favors very steep psychometric functions in Figure 3a, whereas if we simply re-parameterize our psychometric function, this tendency to favor steep psychometric functions disappears in Figure 3b.

*π*

_{l}, a uniform distribution on [0,1], indeed represents maximal uncertainty. The hypotheses that every experimental observation was independent of the stimulus or that no lapse occurred are equally likely. That might reflect the uncertainties of a scientist under certain circumstances, but in general the notion of a lapse implies a rare event. Note that a flat prior on the lapse probability allows the model to explain all the data as a sequence of lapses, which intuitively minimizes the credibility of every observation. So if the scientist can safely assume that the lapse rate of an observer in a given task is small, the observations become more informative about the psychometric function and so its parameters can be better identified. On the other hand, excluding the potential existence of outliers forces the model to explain every single observation such that a single observation can become decisive. Note also that in the procedures described by Wichmann and Hill (2001a, 2001b) the parameter similar to the lapse rate is constrained during the ML optimization.

*π*

_{l}takes values in the unit interval, and therefore the Beta distribution

*p*(

*π*

_{l}|

*α,β*) = Beta(

*π*

_{l}|

*α,β*) is a convenient choice (see Figure 4a). For

*α*= 1 and

*β*= 1 the uniform distribution on [0,1] is a particular case.

*θ*, parameterizing the psychometric function as described above is advantageous because the parameters have a more intuitive interpretation. For convenience, we specify the priors independently

*p*(

*θ*) =

*p*(

*θ*

_{1})

*p*(

*θ*

_{2}). Especially for the location, parameter

*m*, a normal (Gaussian) distribution, is a convenient choice if its value is unconstrained. By setting the standard deviation to increasingly large values, the prior becomes vaguer. For parameters that are known to be strictly positive, for example, the width

*w*or the slope

*s*, the gamma or the log-normal distribution can be used. If

*x*is log-normal distributed, log(

*x*) follows a normal distribution. See Figure 4b for examples of gamma and log-normal probability density functions. This section sketched only a small selection of possible densities one can use for specifying priors on the parameters of psychometric functions. If common distributions are not sufficient to model the prior, mixtures of distributions can also be used.

*F*is a Gumbel function with threshold location

*m*= 5, width

*w*= 3 (for

*α*= 0.1), and lapse probability

*π*

_{l}= 0.05. For

*k*= 6 stimulus intensities

*x*

_{i}, corresponding to the

*F*values equal to 0.1, 0.3, 0.6, 0.74, 0.84, and 0.94, we generate

*N*

_{i}= 60 samples respectively, which sums to 360 Bernoulli trials in total.

*μ*= 2 and

*SD*

*σ*= 10, which expresses very little information about

*m*. On the width we put a log-normal prior distribution log

*N*(1,1) (see Figure 4b). Using hybrid MCMC sampling we simulate a Markov chain of 2000 samples from the posterior with 100 leapfrog steps and step sizes (0.5,0.1,0.2), which were chosen to obtain an acceptance rate of approximately 80% and very little autocorrelation between samples.

*x,θ,π*

_{l}) are depicted in Figure 5a. By taking samples from the MCMC chain and plotting the corresponding Ψ(

*x,θ,π*

_{l}), we obtain Figure 5b. Each of the sampled functions represents a hypothesis about the underlying generative function valid under the posterior. The functions are relatively close for large values of

*x*but show rather large differences for smaller stimulus intensities. This can be interpreted such that the experimental observations for low stimulus intensities do still support a rather wide range of hypotheses about the width of the psychometric function.

*m*and

*w*are generated, while the assumed

*π*

_{l}value varies according to the prior.

*m*, the samples shown in Figure 6b suggest that the threshold location is well inferred from the data. The prior, which was a wide normal, is approximately constant in the plotted region, and we observe that the data were very informative. The posterior samples of the width

*w*illustrated in Figure 6c show that the data were informative about

*w*but the remaining uncertainty is still relatively large. Note that the function samples given in Figure 5b already indicated that the posterior still supports a rather wide range of hypotheses on

*w*.

*parametric*bootstrap (Efron & Tibshirani, 1993). In parametric bootstrap methods

*i*= 1,…,

*B*, artificial or synthetic data sets

*D*

_{i}* are generated from one’s fitted model using the maximum likelihood estimates , and, if appropriate, of the parameters as generating parameters for the synthetic data sets. For each synthetic dataset

*D*

_{i}*, we obtain another set of maximum likelihood estimates and thus a bootstrap distribution of

*i*= 1,…,

*B*values for each parameter. From these distributions we obtain

*bias-corrected and accelerated confidence*intervals (Efron, 1987), currently considered state of the art in bootstrap techniques. (For the bootstrap experiments, we used the psignifit software implementation of the method sketched above and described in more detail by Wichmann and Hill, 2001b.)

*N*and the lapse parameter

*π*

_{l}. The Gumbel function as described above with

*m*= 5 and

*w*= 3 and same sample locations were used. The data set size N takes the values 90, 360, and 900 and lapse rate is set to either 0.05 or 0.15. For each of the six conditions, we generated 1000 data sets.

*p*(

*m*) =

*N*(2,10) and

*p*(

*w*) = log

*N*(1,1).

*π*

_{l}= 0.05, we used a Beta(2,50) prior for

*π*

_{l}. To make similar information available to the bootstrap method, we constrained the

*π*

_{l}parameter to [0,0.1] during maximum likelihood estimation. For data sets generated with

*π*

_{l}= 0.15, we used Beta(2,20) and [0,0.25]. The use of different prior distributions is necessary because either method-Bayesian inference and bootstrapping-is highly sensitive to the prior information on

*π*

_{l}. This is not a problem, however, because it appears realistic to assume that an experimenter is to some degree aware of the lapse rate and asymptotic performance of a subject during experiments resulting in different prior beliefs.

*m*and

*w*. We compare the MAP estimate, the MCMC sample mean (MEAN) and median (MED), the maximum likelihood (ML) estimate, and the constrained ML (CML) estimate computed by psignifit. Each line in the following table states the median of the absolute errors |

*m*−

*m**| of these point estimates in 1000 repeated experiments for the different values of

*N*and

*π*

_{l}.

N | π_{l} | MAP | MEAN | MED | ML | CML |
---|---|---|---|---|---|---|

90 | 0.05 | 0.301 | 0.316 | 0.289 | 0.349 | 0.336 |

90 | 0.15 | 0.370 | 0.426 | 0.353 | 0.418 | 0.401 |

360 | 0.05 | 0.147 | 0.141 | 0.136 | 0.166 | 0.165 |

360 | 0.15 | 0.232 | 0.175 | 0.179 | 0.241 | 0.230 |

900 | 0.05 | 0.102 | 0.088 | 0.090 | 0.109 | 0.110 |

900 | 0.15 | 0.183 | 0.131 | 0.139 | 0.166 | 0.155 |

*m*, the sample median MED consistently shows good accuracy. Note that the MED minimizes the expected absolute error so this result conforms to the theory.

N | π_{l} | MAP | MEAN | MED | ML | CML |
---|---|---|---|---|---|---|

90 | 0.05 | 0.906 | 1.331 | 1.024 | 1.446 | 1.363 |

90 | 0.15 | 0.905 | 2.113 | 1.128 | 1.717 | 1.655 |

360 | 0.05 | 0.479 | 0.517 | 0.478 | 0.629 | 0.635 |

360 | 0.15 | 0.559 | 0.616 | 0.574 | 0.828 | 0.826 |

900 | 0.05 | 0.314 | 0.334 | 0.320 | 0.400 | 0.401 |

900 | 0.15 | 0.464 | 0.431 | 0.405 | 0.502 | 0.495 |

*w*is best estimated by the MAP followed by the MED. The errors decrease with sample size and increase for the large lapse rate.

*over-conservative*statements, whereas smaller values indicate

*over-confidence*. With the frequency we also report the median width of the computed confidence intervals.

m | w | ||||
---|---|---|---|---|---|

N | π_{l} | Accuracy | Width | Accuracy | Width |

90 | 0.05 | 0.783 | 1.422 | 0.903 | 9.049 |

90 | 0.15 | 0.707 | 1.558 | 0.911 | 18.258 |

360 | 0.05 | 0.863 | 0.757 | 0.883 | 3.184 |

360 | 0.15 | 0.795 | 0.915 | 0.865 | 4.148 |

900 | 0.05 | 0.848 | 0.488 | 0.859 | 1.934 |

900 | 0.15 | 0.818 | 0.682 | 0.867 | 2.573 |

m | w | N | π_{l} | Accuracy | Width | Accuracy | Width | |||
---|---|---|---|---|---|---|---|---|---|---|

90 | 0.05 | 0.911 | 1.765 | 0.933 | 8.103 | |||||

90 | 0.15 | 0.922 | 2.340 | 0.981 | 11.882 | |||||

360 | 0.05 | 0.918 | 0.750 | 0.931 | 2.959 | |||||

360 | 0.15 | 0.926 | 0.884 | 0.937 | 3.745 | |||||

900 | 0.05 | 0.919 | 0.457 | 0.916 | 1.694 | |||||

900 | 0.15 | 0.901 | 0.585 | 0.916 | 2.209 |

*m*, the approximated Bayesian confidence intervals exhibit accuracy close to the desired 90% for all six conditions. The bootstrap confidence intervals appear to be over-confident, especially for small data sets and high lapse rates. For small data sets

*N*= 90, the width of the Bayesian confidence regions is larger, whereas for larger data set sizes, the Bayesian confidence intervals exhibit higher accuracy but smaller interval width.

*w*parameter, both the bootstrap and the Bayesian confidence regions were found to be relatively accurate. The Bayesian confidence regions tend to be conservative, especially for

*N*= 90 and

*π*

_{l}= 0.15, whereas the median width over the confidence regions is consistently smaller.

*α*= 2 and

*β*= 50 (see Figure 4a). This prior also expresses our belief that observers are unlikely to perform errorlessly. The mean of the prior distribution is given by

*α*/(

*α*+

*β*) ≈ 4% and the mode is (

*α*− 1)/(

*α*+

*β*− 2) = 2%. Thinking about the stimulus, we can derive a conservative prior on the threshold location. At 100% contrast, a sine wave can clearly be seen, but at about 10%, the task becomes difficult. This is the range in which we would expect the threshold. At 1% the task seems almost impossible. A reasonable prior on log-contrast threshold therefore has a maximum at −1 (10%). We can take a Gaussian with this mean. Even though −2 corresponding to 1% and 0 corresponding to 100% seem to be unlikely threshold values, we do not want to rule out these hypotheses a priori. Therefore,

*SD*1 seems to be a conservative choice.

*α*= 2 and

*β*= 1.5, we found it to be a good description of our beliefs. The mean is given by

*α*/

*β*= 1.33 and the mode by (

*α*− 1)/

*β*= 0.66. The standard deviation that is given by is large enough to support values even bigger than 2.

*θ*we found that even relatively small data sets are often informative enough to overrule the prior. However, a Bayesian analysis should always report prior and posterior distributions, and the latter should always be interpreted relative to the prior given the model.