**Maximum likelihood difference scaling (MLDS) is a method for the estimation of perceptual scales based on the judgment of differences in stimulus appearance (Maloney & Yang, 2003). MLDS has recently also been used to estimate near-threshold discrimination performance (Devinck & Knoblauch, 2012). Using MLDS as a psychophysical method for sensitivity estimation is potentially appealing, because MLDS has been reported to need less data than forced-choice procedures, and particularly naive observers report to prefer suprathreshold comparisons to JND-style threshold tasks. Here we compare two methods, MLDS and two-interval forced-choice (2-IFC), regarding their capability to estimate sensitivity assuming an underlying signal-detection model. We first examined the theoretical equivalence between both methods using simulations. We found that they disagreed in their estimation only when sensitivity was low, or when one of the assumptions on which MLDS is based was violated. Furthermore, we found that the confidence intervals derived from MLDS had a low coverage; i.e., they were too narrow, underestimating the true variability. Subsequently we compared MLDS and 2-IFC empirically using a slant-from-texture task. The amount of agreement between sensitivity estimates from the two methods varied substantially across observers. We discuss possible reasons for the observed disagreements, most notably violations of the MLDS model assumptions. We conclude that in the present example MLDS and 2-IFC could equally be used to estimate sensitivity to differences in slant, with MLDS having the benefit of being more efficient and more pleasant, but having the disadvantage of unsatisfying coverage.**

*s*are associated with discrete perceptual responses

_{i}*ε*, which is assumed to be Gaussian distributed with zero mean and variance

*σ*

^{2}. MLDS estimates the perceptual scale together with the noise associated with the judgments (Knoblauch & Maloney, 2008; Maloney & Yang, 2003).

*d*′. This transformation has been suggested by Devinck and Knoblauch (2012) to compare supra- and near-threshold judgments in the watercolor effect. A detailed description of the transformation and the MLDS model is provided in the Appendix.

^{2}). (b) They are independent. (c) The decision process is deterministic. (d) The sensory representation function is monotonically increasing. This produces only positive values of sensory response intervals so that the absolute value operation can be removed from the decision rule (Δ variable in Figure 2A and B). An MLDS decision model with the above assumptions is equivalent to a signal detection model with equal-variance and Gaussian distributed sensory representations, as depicted in Figure 2C.

^{1}

*s*) =

*s*, with exponent

^{e}*e*= 2.0 (Figure 2A). We used an exponent greater than one so that sensitivity would increase with stimulus intensity, which is the case for slant-from-texture (Knill, 1998). The sensory representation function was used to simulate responses of a model observer for the MLDS and the 2-IFC procedure. It was assumed to be a Gaussian random variable with the mean corresponding to Ψ(

*s*) and unique variance

*σ*

^{2}(Figure 2B through C). An example simulation is depicted in Figure 3. Thresholds were derived for a standard value of

*st*= 0.6 from MLDS scales (panel A) and from psychometric functions in a 2-IFC task (panel B).

*s*

_{1},

*s*

_{2,}and

*s*

_{3}. To simulate a triad, the generative model (Figure 2A) assigns perceptual responses, Ψ

*, to each of the three stimuli,*

_{i}*s*. The simulated observer decides which of the pairs, (

_{i}*s*

_{1},

*s*

_{2}) or (

*s*

_{2},

*s*

_{3}), contains the bigger difference in perceived slant according to the decision model depicted in Figure 2B.

*MLDS*, available in CRAN (Knoblauch & Maloney, 2008) and with python routines based on

*numpy*and

*scipy*libraries. A python wrapper of the

*MLDS*routines together with all subsequent analysis routines is available online (http://github.com/TUBvision/mlds).

*d*′. The details underlying the reparametrization are explained in Appendix, Difference scales and signal detection theory. In the simulation we derived sensitivity estimates for eight standard values (experiments were done with four standard values). Due to the nonlinear shape of the perceptual scale, the local slopes differed between different standard values and hence translated into different sensitivity levels along the stimulus dimension. For each standard we determined sensitivity at three performance levels (

*d*′ = 0.5, 1, and 2) above and below the standard. To derive the stimulus values that corresponded to each

*d*′ difference for a given standard, we interpolated between the sampled data points with a cubic spline fit (

*d*′ units, that corresponds to a particular standard stimulus (

*st*) and performance level (

*d*′) was read from the fitted function. The readout can be described by in which the + (–) sign next to

*d*′ stands for comparison values above (below) the standard, and

*F*) with the guess rate (

*γ*) set to 50% chance level. The lapse rate (

*λ*), slope, and position parameters of the psychometric function were estimated using Bayes inference (Kuss, Jäkel, & Wichmann, 2005). We used the

*psignifit4*implementation (Schütt, Harmeling, Macke, & Wichmann, 2016) for function fitting, estimation of confidence intervals, and analysis of goodness of fit. Each psychometric function was estimated from a total of 320 trials (4 comparison values × 80 repeats) as in the experiments.

*d*′. Assuming the equal variance Gaussian case of a signal detection model (Green & Swets, 1966),

*d*′ can be converted to percentage correct and vice versa, and the threshold can be read out by where + (–) indicate comparisons above (below) the standard,

*Fc*′ = {0.28, 0.52, 0.84} are the unscaled fractions correct (range between 0 and 1) that correspond to the raw fractions correct

*Fc*= {0.64, 0.76, 0.92} (range between 0.5 and 1.0). These fraction correct values

*Fc*correspond to the performance levels of

*d*′ = 0.5, 1, and 2, respectively, in a two-alternative forced-choice task (Green & Swets, 1966).

*n*= 1,000 simulations, and Figure 4A shows a summary of the results for the average of the empirically observed noise level,

*σ*= 0.07 (green lines). Thresholds agreed in more than 90% of the cases, and the agreement was also high across a range of noise levels that we tested, from

*σ*= 0.035 to

*σ*= 0.14 (see Supplementary material), which includes all the values of sensory noise observed in the experiments.

*hard core point process*, which is a random spatial process that avoids dot superposition by applying an inhibition radius to each point. Using the R package

*spatstat*(Baddeley & Turner, 2005), we generated fifteen samples of this process following specifications from previous work (Rosas et al., 2004). The textures consisted of black dots (0.4–0.6

*cd/m*

^{2}, 12 pixels or 0.5° visual angle in diameter in the fronto-parallel plane) on a gray background area (48–52

*cd/m*

^{2}, Figure 1).

*python*and the visualization library

*pyglet*. Observers' responses were registered via the keyboard.

*s*) varied between 0 (fronto-parallel) to 70° in steps of 10°. This spacing results in

*p*= 8 possible slant values and a total number of

*n*=

*p*!/((

*p*– 3)! × 3!) = 56 unique triads.

*s*

_{1}<

*s*

_{2}<

*s*

_{3}) or descending (

*s*

_{1}>

*s*

_{2}>

*s*

_{3}) order, and the order was randomized across trials. Observers were asked to report which of the pairs, (

*s*

_{1},

*s*

_{2}) or (

*s*

_{2},

*s*

_{3}), contained the bigger perceived difference in slant. Observers viewed the stimulus configuration with no time limit for their response. They indicated their choice by pressing a keyboard button, and this triggered the next trial after a delay of one second. No feedback was given as to the correctness of the response.

*difference*, some observers reported the pair that included the most extreme slant.

*d*′ = 0.5, 1, 2, and 3 that were derived from the MLD scale (see Simulations, MLDS thresholds section). After the first session the comparison values were adjusted so as to provide good coverage of the psychometric function (Wichmann & Hill, 2001). The full experimental design contained 4 standards × 8 comparison values (four above and four below the standard) × 80 repeats resulting in 2,560 trials in total. This amount was the same as in the simulations. The presentation was randomized and the total number of trials was subdivided into 40 blocks of 64 trials each. Observers completed all trials in three to four sessions of maximum one hour duration. Experiment 2 was run on a different day from Experiment 1 and subsequent to it.

*d*′ = ±0.5, 1, 2 and for the four standard values tested (panels). Data points lying on the main diagonal indicate a quantitative agreement between thresholds. This was observed for thresholds obtained at standard slants of 37°, 53°, and 66°. For a standard slant of 26°, a correspondence between thresholds from both methods was observed for comparisons that were larger than the standard. For comparisons that were below the standard MLDS, thresholds were smaller than 2-IFC thresholds. For some combinations of performance levels and standard values thresholds from either or both methods could not be calculated (see Simulations, Thresholds that could not be obtained section).

*d*′ = 1, 2 for comparisons below the standard slant of 26°. The reason for this discrepancy was a shallow slope in the scale reflecting low sensitivity at that particular stimulus level.

*d*′. The independence assumption would be violated when the sensory representations cannot be characterized as independent realizations of a Gaussian random variable but are instead correlated with each other. We tested the effect of these kinds of correlations in the sensory representation in simulations. Correlated sensory variables do indeed affect the threshold estimates. To illustrate the effect, we show that the magnitude of the correlation can be chosen so as to elicit a correspondence between thresholds derived from MLD scales and from 2-IFC. Figure 10 shows the thresholds for observer O8 for a simulated case in which the sensory representations are highly correlated (

*ρ*= 0.9). As a consequence of this correlation, we give up the independence assumption and would have to rescale the perceptual scale by a factor of 0.6 (instead of the theoretical factor of two). In this scenario the resulting thresholds from MLDS correspond better with the thresholds from 2-IFC. Thus, an alternative transformation that accounts for a model violation can “produce” a higher agreement between the two types of thresholds. We are not aware of any method to test the assumption of independence empirically, and it is therefore not possible to evaluate which of the many possible transformations is closest to the true sensory representation.

*Journal of Statistical Software*, 12 (6), 1–42.

*Fundamentals of scaling and psychophysics*. New York: Wiley.

*Behavioral and Brain Sciences*, 12 (02), 269–270, doi:10.1017/S0140525X00048585.

*A, Optics, Image Science, and Vision*, 24 (11), 3418–3426.

*An introduction to the bootstrap*. New York: Chapman and Hall.

*Journal of the Optical Society of America. A, Optics, Image Science, and Vision*, 27 (5), 1232–1244.

*Elemente der psychophysik*[Translation:

*Elements of psychophysics*]. Leipzig, Germany: Breitkopf und Hartel.

*Psychological Science*, 22 (6), 812–820, doi:10.1177/0956797611408734.

*Psychophysics the fundamentals*(3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

*Psychological Review*, 120 (3), 472–496, doi:10.1037/a0033136.

*Signal detection theory and psychophysics*. New York: Wiley.

*Journal of the Optical Society of America. A, Optics, Image Science, and Vision*, 24 (8), 2122–2133, doi:10.1364/JOSAA.24.002122.

*Vision Research*, 128, 1–5, doi:10.1016/j.visres.2016.09.004.

*Vision Research*, 38 (11), 1683–1711, doi:10.1016/S0042-6989(97)00415-X.

*Journal of Statistical Software*, 25, 1–26, doi:10.1.1.204.8835.

*Modeling psychophysical data in R*. New York, NY: Springer.

*Behavioral and Brain Sciences*, 12, 251–320, doi:10.1017/S0140525X0004855X.

*The Journal of General Psychology*, 3 (3), 412–430, doi:10.1080/00221309.1930.9918218.

*Vision Research*, 1–12, doi:10.1016/j.visres.2015.01.023.

*Vision Research*, 44 (13), 1511–1535, doi:10.1016/j.visres.2004.01.013.

*British Journal of Mathematical and Statistical Psychology*, 50 (2), 187–203, doi:10.1111/j.2044-8317.1997.tb01140.x.

*Perception*, 32 (2), 211–233, doi:10.1068/p5012.

*Vision Research*, 122, 105–123, doi:10.1016/j.visres.2016.02.002.

*Opengl programming guide: The official guide to learning opengl, version 2*(5th edition). Upper Saddle River, NJ: Addison-Wesley.

*Psychological Review*, 64 (3), 153–181.

*Psychophysics: Introduction to its perceptual, neural, and social prospects*. New York: Wiley.

*Psychology Review*, 273–286.

*Vision Research*, 45 (12), 1501–17, doi:10.1016/j.visres.2005.01.003

*Quarterly Journal of Experimental Psychology*, 16 (1), 11–22, doi:10.1080/17470216408416341.

*Quarterly Journal of Experimental Psychology*, 16 (4), 387–391, doi: 10.1080/17470216408416400.

*Vision Research*, 46 (14), 2166–2191, doi:10.1016/j.visres.2006.01.010.

*Perception & Psychophysics*, 63 (8), 1314–1329.

*s*

_{1},

*s*

_{2},

*s*

_{3}). The task is to decide which of the two adjoining pairs, (

*s*

_{1},

*s*

_{2}) or (

*s*

_{2},

*s*

_{3}), comprises the larger interval. The stimuli are assumed to produce discrete responses on a singular sensory representation (Ψ

*), and these sensory responses are used to compute a decision variable Δ. It is further assumed that the decision variable Δ is corrupted by additive noise*

_{s}*ε*∼

*N*(0,

*σ*

^{2}). When Δ > 0, the model evaluates the interval in pair (

*s*

_{2},

*s*

_{3}) as larger, (

*s*

_{1},

*s*

_{2}) otherwise. The

*p*different stimuli are chosen from the stimulus dimension, giving a total number of

*n*=

*p*!/((

*p*– 3)! × 3!) unique triads to be judged.

*) from a triad experiment and the stimulus design matrix (*

**Y***), and it estimates a set of coefficients*

**X***that best account for the data. Formally, this model is described by where*

**β***is a vector of length*

**Y***n*with entries 0 or 1, indicating the observer's response (for first vs. second pair, respectively).

*is the design matrix of size*

**X***n*×

*p*, whereby

*n*is the total number of triads and

*p*is the number of stimulus levels sampled as well as the number of estimated points on the perceptual scale. Each row in matrix

*contains nonzero entries (1,-2, and 1) in the columns corresponding to the stimulus values for the presented triad values (*

**X***s*

_{1},

*s*

_{2}, and

*s*

_{3}), and zero entries in the remaining

*p*– 3 columns. The coefficient vector

*p,*and it contains the scale estimates.

*g*() is required to establish the relationship between the

*linear predictors*and the mean response variable

**X β***is binomially distributed with*

**Y***n*=1 (also known as a Bernoulli process). We used the default link function for MLDS, that is, the inverse of the Gaussian cumulative distribution function (Φ

^{–1}), as it has been shown to be robust against distribution changes and deviations from the equal variance assumption (Maloney & Yang, 2003). The coefficients

^{2}

*ε*) in the decision process, whereby

*σ*in Equation A2.

*d′*, when the following assumptions are met. First, it is assumed that the decision process is not stochastic but deterministic. This would attribute all of the observed noise to the sensory representation,

*ψ*, which is a Gaussian random variable with mean Ψ

_{s}*. Second, it is assumed that the noise is constant, i.e. independent of the stimulus level. Finally, it is assumed that the sensory representations are independent of each other. It follows from these assumptions that the*

_{s}*ψ*are independent Gaussian random variables with equal variance.

_{s}^{3}Then, the noise parameters can be “carried” to the sensory representation, by rewriting the decision model (Equation A2) in this way

*σ*

^{2}(Equation A2). When rewriting the model equations, the variance of each sensory representation

*ψ*must be adjusted so that Equation A2 still holds. Because the decision variable Δ is computed as a linear combination of four independent, Gaussian random variables, its variance is four times the variance of each individual variable

_{s}*σ*of the decision variable and not of the sensory representation directly. By knowing the above explained relationship between the variance in the sensory representation and in the decision variable, the difference scale can be adjusted so as to represent the variance in the sensory representation. The

*ψ*(Equation A5). Thus, the conversion is accomplished by multiplying the original scale by a factor of two (see also Devinck & Knoblauch, 2012). Formally, the new transformed scale maximum is two times the maximum of the original scale This new scale

_{s}*d*′ units,” i.e., an interval difference of one in the scale dimension should represent a performance of

*d*′ of one, when all assumptions are met.

*Y*) for each triad, in other words, the mean probability of binary responses given the presented stimulus values in each triad. These probabilities are used to simulate a Bernoulli response in each triad, which is in turn used to estimate a new set of coefficients

*j*= 1…

*p*are the

*i-th*bootstrap sample, and many bootstrap samples are drawn by repeating the simulation procedure many times (

*N*= 10,000), obtaining a matrix

_{s}*of*

**S***N*×

_{s}*p*entries. Confidence intervals for the

*j-th*scale value

*j*= 1...

*p*were obtained by taking the “bias-corrected and accelerated” (BCa; Efron & Tibshirani, 1993) percentiles corresponding to a 95% CI.

*.*

**S***can be used to obtain the variability of the threshold estimation. The same fitting and readout procedure applied to the point estimate of the scale (see Simulations, MLDS thresholds section) was applied to each bootstrap sample*

**S***j*= 1...

*p*. We fitted a spline to each scale bootstrap sample, i.e. to each row in matrix

*, and from this scale we readout a bootstrap threshold. By repeating to all*

**S***i*= 1.

*N*bootstrap samples, a distribution of thresholds bootstrap samples is calculated, and confidence intervals

_{s}^{1}Which, in turn, is analogous to Thurstone's case V of the Law of Comparative Judgment (Thurstone, 1927).

^{2}Although MLDS assumes a monotonically increasing function for the decision model, there is no restriction of monotononicity imposed for the coefficients found by the GLM solver. Thus, nonmonotonic scales from MLDS are possible outcomes (see O2 and O3 in Supplemental Figure S3).

^{3}This restricted case can be derived from the MLDS model only because: (i) the decision variable after the differencing is assumed to be Gaussian, for which the simplest case is when it is produced by underlying Gaussian distributed representations; and (ii) equal-variance is the simplest case to relate the variance of each representation with the variance of the decision variable. Other models (e.g. unequal-variance) cannot be derived as they would be underconstrained.