**Psychometric functions (PFs) quantify how external stimuli affect behavior, and they play an important role in building models of sensory and cognitive processes. Adaptive stimulus-selection methods seek to select stimuli that are maximally informative about the PF given data observed so far in an experiment and thereby reduce the number of trials required to estimate the PF. Here we develop new adaptive stimulus-selection methods for flexible PF models in tasks with two or more alternatives. We model the PF with a multinomial logistic regression mixture model that incorporates realistic aspects of psychophysical behavior, including lapses and multiple alternatives for the response. We propose an information-theoretic criterion for stimulus selection and develop computationally efficient methods for inference and stimulus selection based on adaptive Markov-chain Monte Carlo sampling. We apply these methods to data from macaque monkeys performing a multi-alternative motion-discrimination task and show in simulated experiments that our method can achieve a substantial speed-up over random designs. These advances will reduce the amount of data needed to build accurate models of multi-alternative PFs and can be extended to high-dimensional PFs that would be infeasible to characterize with standard methods.**

*psychometric function*. Because the psychometric function is not directly observable, it must be inferred from multiple observations of stimulus–response pairs. However, such experiments are costly due to the large numbers of trials typically required to obtain good estimates of the psychometric function. Therefore, a problem of major practical importance is to develop efficient experimental designs that can minimize the amount of data required to accurately infer an observer's psychometric function.

**x**∈ ℝ

^{d}and selects a response

*k*discrete choices on each trial. We will assume the stimulus is represented internally by some (possibly nonlinear) feature vector

*(*

**ϕ****x**), which we will write simply as

*for notational simplicity.*

**ϕ***p*of each possible outcome

_{i}*and a vector of weights*

**ϕ****w**

*according to*

_{i}**w**

*=*

_{k}**0**, so that the denominator in Equation 1 becomes

*to be a known function of the stimulus*

**ϕ****x**, even when the dependence is not written explicitly. For example, we can consider a simple form of feature embedding,

*i*th choice would correspond to

*i*th choice and

**a**

*are the linear weights governing sensitivity to*

_{i}**x**. The resulting choice probability has the familiar form

**x**in the feature vector

*(*

**ϕ****x**) (Knoblauch & Maloney, 2008; Murray, 2011; Neri & Heeger, 2002). Dependencies on the trial history, such as the previous stimulus or reward, may also be included as additional features in

*(see, e.g., Bak, Choi, Akrami, Witten, & Pillow, 2016).*

**ϕ***x*over the stimulus space is

_{α}*b*is the expectation value of log probability over all possible stimuli. The unit-variance condition means that the effect of moving a certain distance along one dimension of the weight space is comparable to moving the same distance in another dimension, again averaged over all possible stimuli. In other words, we are justified in using the same unit along all dimensions of the weight space.

*omissions*or

*violations*and are typically discarded from analysis. However, failure to take these trials into account may bias the estimates of the PF if these trials are more common for some stimuli than others (see Figure 2B).

*k*+ 1 in addition to the

*k*valid responses. One could even consider different kinds of omissions separately—for example, allowing choice

*k*+ 1 to reflect fixation-period violations and choice

*k*+ 2 to reflect failure to report a choice during the response window. Henceforth, we will let

*k*reflect the total number of choices including omission, as illustrated in Figure 1B.

*lapses*or

*button-press errors*, may reflect lapses in concentration or memory of the response mapping (Kuss et al., 2005; Wichmann & Hill, 2001a). Lapses are most easily identified by errors on easy trials—that is, trials that should be performed perfectly if the observer is paying attention.

*a*∼ Ber(

*λ*) governs whether the observer lapses: With probability

*λ*the observer lapses (i.e., ignores the stimulus), and with probability 1 −

*λ*the observer attends to the stimulus. If the observer lapses (

*a*= 1), the response is drawn according to the fixed-probability distribution

*k*, where

*a*= 0), the response is selected according to the MNL model. Under this model, the conditional probability of choosing option

*i*given the stimulus can be written as

*q*is the lapse-free probability under the classical MNL model (Equation 1).

_{i}*λc*, the conditional probability of choosing the

_{i}*i*th option due to a lapse, is written

*u*is proportional to the log probability of choosing option

_{i}*i*due to lapse. The lapse-conditional probabilities

*c*of each choice and the total lapse probability

_{i}*λ*are respectively

*u*lives on the entire real line, fitting can be carried out with unconstrained optimization methods, although adding reasonable constraints may improve performance in some cases. The full parameter vector of the resulting model is

_{i}*k*additional lapse parameters

*c*= 1/

_{i}*k*. For this simplified uniform-lapse model we need only a single lapse parameter

*u*. Note that we have unified the parameterizations of the lapse rate (deviation of the upper asymptote of the PF from 1; in this case,

*λ*−

*λc*) and the guess rate (deviation of the lower asymptote from 0; in this case,

_{i}*λc*), which have often been modeled separately in previous works with two-alternative responses (Schütt, Harmeling, Macke, & Wichmann, 2016; Wichmann & Hill, 2001a, 2001b). Here they are written in terms of a single family of parameters

_{i}*that includes both the weight and lapse parameters, rather than treating the lapse separately. In particular, our parameterization (Equation 3) has the advantage that there is no need to constrain the support of the lapse parameters*

**θ***u*. These parameters' relationship to lapse probabilities

_{i}*c*takes the same (softmax) functional form as the MNL model, placing both sets of parameters on an equal footing.

_{i}*p*(

*), which captures prior uncertainty about the model parameters*

**θ***, and a likelihood function*

**θ***s*= 1, …,

*t*consists of stimulus–response pairs observed up to the current time bin

*t*.

*σ*. This choice of prior is appropriate when the regressors {

**x**} are standardized, since any single weight can take values that allow for a range of PF shapes along that axis, from flat (

*w*= 0) to steeply decreasing (

*w*= −2

*σ*) or increasing (

*w*= +2

*σ*). We used

*σ*= 3 in the simulated experiments in Results. For the lapse parameters

*λc*is bounded between 0.001 and 1/2. We set the lower range constraint below 1/

_{i}*N*, where

*N*= 100 is the number of observed trials in our simulations, since we cannot reasonably infer lapse probabilities with precision finer than 1/

*N*. The upper range constraint gives maximal lapse probabilities of 1/(

*k*+ 1) if all

*u*take on the maximal value of 0. Note that our prior is uniform with respect to the rescaled lapse parameters

_{i}*y*to be a scalar taking values in the set {1, … ,

*k*}, it is more convenient to use a “one-hot” or “1-of-

*k*” representation, in which the response variable

**y**for each trial is a vector of length

*k*with one 1 and

*k*− 1 zeros; the position of the 1 in this vector indicates the category selected. For example, in a task with four possible options per trial, a response vector

**y**= [0 0 1 0] indicates a trial on which the observer selected the third option.

*p*(

_{i}**x**,

*) denotes the probability*

**θ***, which guarantees that numerical optimization of the log likelihood will find a global optimum. With a finite lapse rate, however, the log likelihood is no longer concave (see Appendix A).*

**θ***t*and

*. Because this constant has no tractable analytic form, we rely on two alternate methods for obtaining a normalized posterior distribution.*

**θ***. This vector, given by*

**θ***not*contain lapses), numerical optimization is guaranteed to find a global maximum of the posterior. Log concavity also strengthens the rationale for using the Laplace approximation, since the true and approximate posterior are both log-concave densities centered on the true mode (Paninski et al., 2010; Pillow, Ahmadian, & Paninski, 2011). When the model incorporates lapses, these guarantees no longer apply globally.

*f*(

*). The mean of the posterior is obtained from setting*

**θ***f*(

*) =*

**θ***, although for adaptive stimulus selection we will be interested in the full shape of the posterior.*

**θ***M*are required to obtain an accurate approximation to the posterior.

**x**and observe the outcome

**y**. After

*t*trials, the expected gain in information from a stimulus

**x**is equal to the mutual information between

**y**and the model parameters

*, given the data*

**θ***mutual information*:

*and*

**θ****y**given a stimulus

**x**and dataset

**y**given

**x**.

*t*,

*given*

**θ****y**,

*after*having observed

**x**and

**y**, averaged over draws of

**y**from the posterior-predictive distribution. Written this way, the mutual information can be seen as the expected reduction in posterior entropy from a new stimulus–response pair. Moreover, the first term

**x**,

**y**given

*, conditioned on the stimulus,*

**θ****y**conditioned on

**x**(with

*integrated out) and the average entropy of*

**θ****y**given

**x**and

*, averaged over the posterior distribution of*

**θ***. The dual expansion of the mutual information has also been used by Kujala and Lukka (2006).*

**θ***t*is the latest trial and

*t*+ 1 is the upcoming one, the optimal stimulus is the information-maximizing (“infomax”) solution:

**x**}. Since each evaluation of the mutual information involves a high-dimensional integral over parameter space and response space, this is a highly computationally demanding task. In the next sections, we present two algorithms for efficient infomax stimulus selection based on each of the two approximate inference methods described previously.

*C*is equal to

*t*+ 1 is given by

**x**,

**y**). To evaluate the updated covariance

*for each possible response*

**θ****y**for any candidate stimulus

**x**, which would be computationally infeasible. We therefore use a fast approximate method for obtaining a closed-form update for

*C*, following an approach developed by Lewi et al. (2009). See Appendix C for details. Note that this approximate sequential update is only used for calculating the expected utility of each candidate stimulus by approximating the posterior distribution at the next trial. For obtaining the MAP estimate of the current model parameter

_{t}*, numerical optimization needs to be performed using the full accumulated data*

**θ**_{t}**y**that are likely under the posterior-predictive distribution. This is done in two steps, by separating the integral in Equation 19 as

**y**, the outer integral over the parameter

*is in general challenging, especially when the parameter space is high dimensional. In the case of the standard MNL model that does not include lapse, we can exploit the linear structure of model to reduce this to a lower dimensional integral over the space of the linear predictor, which we evaluate numerically using Gauss–Hermite quadrature (Heiss & Winschel, 2008). (This integral is 1-D for classic logistic regression and has*

**θ***k*− 1 dimensions for MNL regression with

*k*classes; see Appendix C for details.) When the model incorporates lapses, the full parameter vector

**w**. In this case, our method with Laplace approximation may suffer from reduced accuracy due to the fact that the posterior may be less closely approximated by a Gaussian.

*partial*information

*. Thus for Laplace-based infomax exclusively, the partial covariance*

**θ****y**instead of a reduction in the posterior entropy. This will make it straightforward to approximate integrals needed for mutual information by Monte Carlo integrals involving sums over samples. Also note that we are back in the full parameter space; we no longer treat the lapse parameters separately, as we did for the Laplace-based infomax.

*t*, we can evaluate the mutual information using sums over “potential” terms that we denote by

**x**}. The computational cost of this approach is therefore linear in the number of samples, and the primary concern is the cost of obtaining a representative sample from the posterior.

**x**was selected on each trial and the observer's response

**y**was sampled from a “true” psychometric function,

*λc*were set to either zero (the lapse-free case) or a constant value of 0.05, resulting in a total lapse probability of

_{i}*λ*= 0.2 across the four choices (Figure 5B). We compared performance of our adaptive algorithms with a method that selected a stimulus uniformly at random from the grid on each trial. We observed that the adaptive methods tended to sample more stimuli near the boundaries between colored regions on the stimulus space (Figure 5C), which led to more efficient estimates of the PF compared to the uniform stimulus-selection approach (Figure 5D). We also confirmed that the posterior entropy of the inferred parameters decreased more rapidly with our adaptive stimulus-sampling algorithms in all cases (Figure 5E and 5F). This was expected because our algorithms explicitly attempt to minimize the posterior entropy by maximizing the mutual information.

**x**

*} and the four possible responses {*

_{i}*j*}. For MAP-based inference, estimated probabilities were given by

*λ*= 0.2, which is more typical in the case of less sophisticated animals such as rodents (see, e.g., Scott, Constantinople, Erlich, Tank, & Brody, 2015). With lapse rates more typical in well-designed human psychophysics tasks (

*M*in each MCMC chain rather than by the model complexity. Using the standard implementation for the Metropolis–Hastings sampler in Matlab, a time per trial of approximately 0.1 s was achieved with chains shorter than

*M*≈ 200 was good enough to represent the posterior distributions for our simulated examples (Figure 7C and 7D, bottom panels), although we note that longer chains are required to sample a more complex posterior distribution, and this particular length

*M*should not be taken as the benchmark in general.

**x**

*} not yet incorporated into the model. This selection takes place without access to the actual responses {*

_{i}**y**

*}. We update the posterior using the stimulus*

_{i}**x**

*and the response*

_{i}**y**

*it actually elicited during the experiment, then proceed to the next trial. We can then ask whether adding the data according to the proposed reordering would have led to faster narrowing of the posterior distribution than other orderings.*

_{i}*N*= 100 trials under uniform sampling). On the other hand, the sets of stimuli selected by both MAP-based and MCMC-based infomax algorithms were similar. Figure 8D shows the histogram of stimulus components along one of the axes,

*N*= 100 trials, averaged over 100 independent runs under each stimulus-selection algorithm using the lapse-free model.

*k*additional parameters to sample, one per each available choice, which is independent of the dimensionality of the stimulus space.

*Nature*, 554 (7692), 368–372, https://doi.org/10.1038/nature25510.

*Advances in neural information processing systems 29*(pp. 1947–1955). Red Hook, NY: Curran Associates, Inc.

*arXiv:0809.0387*, 1–28.

*Pattern recognition and machine learning*. New York: Springer.

*The Journal of Neuroscience*, 31 (31), 11351–11361, https://doi.org/10.1523/JNEUROSCI.6689-10.2011.

*Nature Neuroscience*, 16 (7), 824–831, https://doi.org/10.1038/nn.3410.

*Neural Computation*, 22 (4), 887–905, https://doi.org/10.1162/neco.2009.02-09-959.

*Journal of Statistical Planning and Inference*, 21, 191–208, https://doi.org/10.1016/0378-3758(89)90004-9.

*Statistical Science*, 10, 273–304.

*Nature Neuroscience*, 11 (6), 693–702, https://doi.org/10.1038/nn.2123.

*Journal of the Experimental Analysis of Behavior*, 84 (3), 581–617, https://doi.org/10.1901/jeab.2005.23-05.

*Advances in neural information processing systems*30 (pp. 1395–1405). Red Hook, NY: Curran Associates, Inc.

*Journal of Vision*, 15 (9): 5, 1–20, https://doi.org/10.1167/15.9.5. [PubMed] [Article]

*Neural Computation*, 23 (9), 2242–2288, https://doi.org/10.1162/NECO_a_00167.

*Journal of Vision*, 14 (7): 9, 1–16, https://doi.org/10.1167/14.7.9. [PubMed] [Article]

*Proceedings of the thirty-first Conference on Uncertainty in Artificial Intelligence*(pp. 286–297). Arlington, VA: AUAI Press.

*Bayesian Statistics*, 5, 599–607.

*Journal of the Royal Statistical Society, Series B (Methodological)*, 57 (3), 533–546.

*Bernoulli*, 7 (2), 223–242, https://doi.org/10.2307/3318737.

*Journal of Econometrics*, 144 (1), 62–80, https://doi.org/10.1016/j.jeconom.2007.12.004.

*SIAM Review*, 23 (1), 53–60, https://doi.org/10.1137/1023004.

*Linear Algebra and Its Applications*, 103 (C), 103–118, https://doi.org/10.1016/0024-3795(88)90223-6.

*eLife*, 6: e16650, https://doi.org/10.7554/eLife.16650.

*Cognitive Science*, 41 (8), 2234–2252, https://doi.org/10.1111/cogs.12467.

*Neural Computation*, 26, 2465–2492, https://doi.org/10.1162/NECO_a_00654.

*Vision Research*, 34 (7), 885–912, https://doi.org/10.1016/0042-6989(94)90039-6.

*Journal of Vision*, 8 (16): 10, 1–19, https://doi.org/10.1167/8.16.10. [PubMed] [Article]

*Vision Research*, 39 (16), 2729–2737, https://doi.org/10.1016/S0042-6989(98)00285-5.

*Journal of Mathematical Psychology*, 50 (4), 369–389, https://doi.org/10.1016/j.jmp.2005.12.005.

*Journal of Vision*, 5 (5): 8, 478–492, https://doi.org/10.1167/5.5.8. [PubMed] [Article]

*Journal of the Experimental Analysis of Behavior*, 84 (3), 555–579, https://doi.org/10.1901/jeab.2005.110-04.

*Journal of Vision*, 10 (3): 17, 1–21, https://doi.org/10.1167/10.3.17. [PubMed] [Article]

*d*′) in yes-no and forced-choice tasks.

*Frontiers in Psychology*, 6, 1070, https://doi.org/10.3389/fpsyg.2015.01070.

*Neural Computation*, 21 (3), 619–687, https://doi.org/10.1162/neco.2008.08-07-594.

*Journal of Computational Neuroscience*, 30 (1), 181–200, https://doi.org/10.1007/s10827-010-0248-1.

*Neural Computation*, 4 (4), 590–604, https://doi.org/10.1162/neco.1992.4.4.590.

*The Journal of Chemical Physics*, 21 (6), 1087–1092, https://doi.org/10.1063/1.1699114.

*Journal of Vision*, 11 (5): 2, 1–25, https://doi.org/10.1167/11.5.2. [PubMed] [Article]

*Nature Neuroscience*, 5 (8), 812–816, https://doi.org/10.1038/nn886.

*Journal of Computational Neuroscience*, 29 (1), 107–126, https://doi.org/10.1007/s10827-009-0179-x.

*bioRxiv*, 178418, https://doi.org/10.1101/178418.

*Advances in neural information processing systems 24*(pp. 2043–2051). Red Hook, NY: Curran Associates, Inc.

*Advances in neural information processing systems 25*(pp. 2357–2365). Red Hook, NY: Curran Associates, Inc.

*Neural Computation*, 26 (8), 1519–1541.

*Neural Computation*, 23 (1), 1–45, https://doi.org/10.1162/NECO_a_00058.

*Closed loop neuroscience*. San Diego, CA: Academic Press.

*Journal of Vision*, 12 (6): 25, 1–16, https://doi.org/10.1167/12.6.25. [PubMed] [Article]

*Journal of Vision*, 13 (7): 3, 1–17, https://doi.org/10.1167/13.7.3. [PubMed] [Article]

*Annals of Applied Probability*, 7 (1), 110–120, https://doi.org/10.1214/aoap/1034625254.

*Handbook of Markov chain Monte Carlo*(pp. 93–112). Boca Raton, FL: Chapman and Hall CRC, https://doi.org/10.1201/b10905.

*Vision Research*, 122, 105–123, https://doi.org/10.1016/j.visres.2016.02.002.

*eLife*, 4, e11308, https://doi.org/10.7554/eLife.11308.

*Vision Research*, 35 (17), 2503–2522, https://doi.org/https://doi.org/10.1016/0042-6989(95)00016-X.

*Seeing and Perceiving*, 23 (5), 483–515, https://doi.org/10.1163/187847510X532694.

*Journal of Vision*, 17 (3): 10, 1–27, https://doi.org/10.1167/17.3.10. [PubMed] [Article]

*Perception & Psychophysics*, 33 (2), 113–120, https://doi.org/10.3758/BF03202828.

*Journal of Experimental Psychology*, 99 (2), 180–185, https://doi.org/10.1037/h0034736.

*Perception & Psychophysics*, 63 (8), 1293–1313, https://doi.org/10.3758/BF03194544.

*Perception & Psychophysics*, 63 (8), 1314–1329.

*Biometrics*, 55 (2), 437–444, https://doi.org/10.1111/j.0006-341X.1999.00437.x.

*p*governing

_{i}*y*depends only on a 1-D projection of the input,

*linear predictor*. Recall that

*=*

**ϕ***(*

**ϕ****x**) is the input feature vector. In the multinomial case, it is useful to consider the column vector of linear predictors for a single trial,

**V**=

*X*

**w**, where

*X*is a block diagonal matrix containing

*k*blocks of

*L*with respect to

**V**are

**p**on the diagonal and zeros otherwise.

**w**is easy, thanks to the linear relationship

**V**=

*X*

**w**:

*L*is concave with respect to

**V**(and therefore with respect to

**w**). To prove the concavity of

*L*, we show that the Hessian

**z**.

**V**, we get

**V**. The partial gradient is

**z**, we end up with

*p*= (1 −

_{i}*λ*)

*q*+

_{i}*λc*, where

_{i}*r*, therefore vanishing in the lapse-free limit

**x′**based on the current sample value

**x**

*. The choice follows the proposal density function*

_{t}**x′**∼

*Q*(

**x′**|

**x**

*). When the proposal density*

_{t}*Q*is symmetric, for example a Gaussian, the sequence of samples is a random walk. In general the width of

*Q*should match with the statistics of the distribution being sampled, and individual dimensions in the sampling space may behave differently in the multivariate case; finding the appropriate

*Q*can be difficult.

**x′**=

**x**

*. The probability of acceptance is determined by comparing the values of*

_{t}*P*(

**x**

*) and*

_{t}*P*(

**x′**), where

*P*(

**x**) is the distribution being sampled. Because the algorithm only considers the acceptance ratio

*f*(

**x**) can be any function proportional to the desired distribution

*P*(

**x**), there is no need to worry about the proper normalization of the probability distribution. If

*ρ*≥ 1, the move is always accepted; if

*ρ*< 1, it is accepted with a probability

*ρ*. Consequently, the samples tend to stay in the high-density regions, visiting the low-density regions only occasionally.

*of the proposal distribution to the covariance matrix Σ of the sampled chain. Once a linear scaling relation Σ*

_{p}*=*

_{p}*s*Σ is fixed, it has been observed that it is optimal to have

_{d}*s*= (2.38)

_{d}^{2}/

*d*, where

*d*is the dimensionality of the sampling space (Gelman et al., 1996; Roberts et al., 1997). An adaptive Metropolis algorithm (Haario et al., 2001) follows this observation, where the Gaussian proposal distribution adapts continuously as the sampling progresses. That adaptive algorithm uses the same scaling rule Σ

*=*

_{p}*s*Σ but updates Σ

_{d}*at each proposal, where Σ is the covariance of the samples accumulated so far. Additionally, a small diagonal component is added for stability, as*

_{p}*=*

**θ****w**for the moment, where the use of Laplace approximation is valid. Rearranging from Equations 7 and 9, the sequential update for the posterior distribution is

*i*th observation.

**θ**_{t}_{+1}is where the gradient vanishes, it can be approximated from the previous mode

*by taking the first derivative of Equation 48. The posterior covariance*

**θ**_{t}*C*

_{t}_{+1}is similarly approximated by taking the second derivative:

**y**is a vector, not a scalar as in a binary model).

*q*-dimensional) weight space, such that the first

*k*elements of the extended weight vector

**w**are coupled one-to-one to the elements of

*k*-vector

**y**. Then we marginalize to integrate out the remaining

*q*−

*k*dimensions, effectively changing the integration variable from

**w**to

**y**. Finally, we use Cholesky decomposition to standardize the normal distribution, which is the posterior on

**y**. The resulting integral is still multidimensional, due to the multinomial nature of our model. But once the distribution is standardized, there are a number of efficient numerical integration methods that can be applied. For example, in this work, we use the sparse-grid method (Heiss & Winschel, 2008) based on Gauss–Hermite quadrature.

**y**=

*X*

**w**. That is, we are dealing with the integral of the form

*C*is the covariance matrix and

*k*vectors. It helps to work in a diagonalized coordinate system, so that we can separate out the relevant dimensions of

**w**. We use the singular-value decomposition of the design matrix (

*I*in this case. Then

*G*is a

_{k}*k*×

*k*diagonal matrix and

*G*is a

_{q}*k*× (

*q*−

*k*) matrix of zeros. We can now denote

**w**=

*Q*

**w**′, such that

*B*is the inverse of the new covariance matrix after diagonalization. If we block-decompose this matrix,

**w**

*is integrated out, we have marginalized the integral. It is useful to recall that if a variable*

_{q}**y**=

*X*

**w**is also Gaussian distributed as

**y**=

*G*

_{k}**w**

*is then straightforward:*

_{k}*L*directly in terms of the Cholesky decomposition of

*B*

_{*}:

*is independently and identically distributed. The objective function to be evaluated is now*

**θ**