Abstract

We present a stochastic model of suprathreshold perceptual differences based on difference measurement. We develop a maximum likelihood difference scaling (MLDS) method for estimating its parameters and evaluate the reliability and distributional robustness of the fitting method. We also describe a method for testing whether the difference measurement model is appropriate as a description of human judgment of perceptual differences in any specific experimental context.

Introduction

Much research in visual perception involves precise measurement of thresholds. There are well-developed methods for designing experiments to measure threshold (Farell & Pelli, 1998), for analyzing the resulting data (Green & Swets, 1966/1974; Wichmann & Hill, 2001a) and setting confidence intervals on the resulting parameter estimates (Efron & Tibshirani, 1993; Wichmann & Hill, 2001b).

In this article, we develop methods for fitting a model based on difference measurement (Krantz et al., 1971, Chapter 4; See also Roberts, 1979, pp. 134–145) to human judgments of suprathreshold perceptual differences. Researchers have studied suprathreshold color differences (for example, Takasaki, 1966; Ward & Boynton, 19733; Whittle, 1992), contrast differences McCourt & Blakeslee, 1994), and loudness differences (Schneider, 1980ab), but there have been few studies evaluating appropriate experimental methodology and fitting procedures. There is only one series of papers (Schneider, 1980ab; Schneider et al., 1974) reporting a substantial application of difference measurement to a psychophysical problem. We will describe this work in Conclusions and compare it with our own.

We describe the difference measurement model in detail in a later section. The simplest way to explain it is to describe the kind of task that it is intended to model. Consider the upper and lower pair of color samples shown in Figure 1. For most observers the two colors samples in each pair are readily distinguishable1, as evidenced by an appearance of a sharp border separating them. The differences between both pairs are suprathreshold, but at the same time most observers, forced to choose the pair where the difference is color is ‘greater’, would pick the upper pair.

Figure 1

Figure 1

The configuration in Figure 1 is an example of a typical trial from a difference scaling experiment. The observer is asked to examine two pairs of stimuli (‘a quadruple’) and to select the pair with the larger perceptual difference. However, there is more to the experimental design than judgment of quadruples. All four of the color samples in Figure 1 fall on a line in color space, illustrated in Figure 2. The upper pair of samples in Figure 1 are copies of the 1st and 5th color samples in Figure 2. The lower pair are copies of the 7th and 8th. Over the course of an experiment, the observer is asked to make this judgment for a large number of quadruples of color samples, all drawn from Figure 2.

Figure 2

Figure 2

Our goal is to assign numbers

*Ψ*_{1},*Ψ*_{2}, …,*Ψ*_{11}to the 11 stimuli in Figure 2 so that the absolute differences between scale values accurately predicts the observer’s judgments. If, for example, the observer always judges the difference of the upper pair in Figure 2 to be greater that that of the lower, then, whatever the choice of the numbers*Ψ*_{1},*Ψ*_{2}, …,*Ψ*_{11}, they should satisfy |*Ψ*_{2}−*Ψ*_{5}| > |*Ψ*_{7}−*Ψ*_{8}|. The Maximum Likelihood Difference Scaling (MLDS) method we develop takes as input the 2-alternative forced choice (2AFC) data for a few hundred quadruples. It assigns scale values that best predict the experimental data. The MLDS method is the natural, suprathreshold counterpart to standard methods for estimating threshold differences (most recently the elegant papers by Wichmann & Hillis, 2001ab). In Figure 3, we plot estimates of*Ψ*_{i}for the stimuli in Figure 2, for one naive observer. The estimates were obtained using the MLDS method. They are based on 330 2AFC difference scaling trials, each similar to the trial illustrated in Figure 1.Figure 3

Figure 3

The resulting function is sigmoidal, and roughly symmetric about the neutral color patch. The decrease in slope with increasing saturation in both the ‘green’ and the ‘red’ directions is consistent with the well-known ‘crispening’ effect (Takasaki, 1966; Whittle, 1992). These error bars are computed by an application of a resampling method (Efron & Tibshirani, 1993), also described here.

In Figure 4, we give a second example of an application of difference scaling to the measurement of the effect of applying a particular image compression algorithm (vector quantization, described in Gersho & Gray, 1991) on ‘image quality’. The degree of compression is controlled by an arbitrary, univariate parameter,

*γ*, that ranges from 1 (‘no compression’) to 30 (‘maximum compression’). Compression reduces the size of the image by a factor 1/*γ*. Figure 4 contains examples of three images with different degrees of compression as shown. The reader will likely agree that the difference between*γ*= 1 (‘no compression’) and*γ*= 15 is less than the difference between*γ*= 15 and*γ*= 24. The evident aliasing artifacts in the last image make it difficult to interpret the contents of the scene portrayed.Figure 4

Figure 4

Knoblauch et al. (1998) applied MLDS to derive a scale of perceived distortion in the image as a function of compression. The results of applying a MLDS procedure based on 210 trials are shown in Figure 5. The horizontal axis is the degree of image compression,

*γ*. The vertical axis is the estimated scale values. It is evident that the image compression algorithm has little effect on the image when*γ*≤ 15 but, above that point, small changes in*γ*result in larger perceived differences in the image. Of course, the results of the difference scaling do not tell us whether the observer prefers the uncompressed image (*γ*= 1) to the most compressed image (*γ*= 24). It does suggest that so long as*γ*≤ 15, the benefits of image compression come with relatively little change in perceived image quality.Figure 5

Figure 5

We will not pursue these examples further. The key issue we wish to emphasize here is that we could only accept these results if we trust the experimental and computational methods that we have employed. In the next section, we describe a standard difference scaling experiment and develop a stochastic difference scaling model that we will use to predict the probability that an observer will pick a given pair of stimuli over a second given pair. We show how to fit this model to experimental data and, in following sections we investigate its reliability. We show that we can use standard resampling methods to compute confidence intervals for estimated scale values. We test whether failures in the distributional assumptions underlying the method affect the results of the fitting procedures (i.e., whether the estimation method is robust).

We chose the two examples to illustrate the wide range of problems to which we can apply difference scaling. We also note that scaling methods are controversial. It is very likely that the reader, in considering the second example above (‘image quality’), was led to doubt that there is any unidimensional perceptual variable corresponding to image quality. He or she might be less suspicious concerning the color example since there is a considerable literature on comparison of color differences.

One of the most serious failings of the scaling literature is that often there is no evident way to reject a model of proximity judgments as simply inappropriate as a model for human perception of differences (but see Hutchinson & Tversky, 1986; Gerrig, Maloney & Tversky, 1991). In a later section of the paper, we return to this point and propose methods for testing whether the difference scaling model is an adequate model for what the observer has done in a particular experiment.

A Stochastic Model of Difference Measurement

The experimenter selects

*N*stimuli,*S*_{1}, …, S_{N}. Associated with each stimulus*S*_{i}is a real number,*φ*_{i}, and the stimuli are numbered so that*φ*_{1}<*φ*_{2}< … <*φ*_{N}. In our two examples, the values*φ*_{i}corresponded to degree of image compression, and distance from the origin in a color space, respectively. The experimenter presents an observer with*quadruples*, (*S*), and asks him to judge which pair_{i}, S_{j}; S_{k}, S_{l}*S*or_{i},S_{j}*S*is more salient._{k}, S_{l}It will prove convenient to replace ((

*S*)) by the simpler notation (_{i}, S_{j}; S_{k}, S_{l}*i, j; k, l*). In traditional difference measurement, the experimenter wishes to determine*scale values**Ψ*_{1}, …,*Ψ*corresponding to the stimuli,_{N}*S*, … S_{l}_{N}, such that, given a quadruple, (*i, j; k, l*), the observer judges*S*to be further apart than_{i}, S_{j}*S*precisely when, We will usually order the stimuli in a quadruple so that_{k}, S_{l}*Ψ*>_{j}*Ψ*and_{i}*Ψ*>_{l}*Ψ*and then we can omit the absolute value signs in Equation 1._{k}It is unlikely that human observers will be so reliable in judgment as to satisfy the criterion just given, particularly if the differences

*Ψ*−_{j}*Ψ*are near 0. Accordingly, we develop a model that allows the observer to exhibit some stochastic variation in judgment. Let_{i}*l*=_{ij}*φ*−_{j}*φ*, the_{i}*length*of the interval*. Then we assume that the*_{i}, S_{j}*decision variable*2 employed by the observer is where*ε*is a Gaussian random variable with mean zero and standard deviation*σ*> 0. Given the quadruple, (*i, j; k, l*), the observer is assumed to select the pair*S*precisely when,_{i}, S_{j}The proposed model is an equal-variance Gaussian signal detection model (Green & Swets, 1966/1974) where the signal is the difference in the lengths of the intervals,

*−*_{ij}*. When this length is small, negative or positive, relative to*_{kl}*σ*, we expect the observer, presented with the same stimuli, to give different, apparently inconsistent judgments.The choice of a Gaussian random variable is arbitrary and the assumption that the error term is additive and independent of the lengths of the intervals under judgment can also be questioned. In a later section, we will evaluate the distributional stability of the estimation procedure that we describe next.

Maximum Likelihood Difference Scaling

At first glance, it would appear that the stochastic difference scaling model just presented has

*N*+ 1 free parameters*Ψ*_{1}, …,*Ψ*together with the standard deviation of the error term,_{N}*σ*. However, any linear transformation of the*Ψ*_{1}, …,*Ψ*together with a corresponding scaling of_{N}*σ*results in a set of parameters that predicts exactly the same performance as the original parameters. Without any loss of generality, we can set*Ψ*_{1}= 0 and*Ψ*= 1, leaving us with the_{N}*N*− 1 free parameters,*Ψ*_{2}, …,*Ψ*_{N−1}and*σ*. We describe next how to estimate these parameters by the method of maximum likelihood. The fitting procedure is similar to that of maximum likelihood estimates used in fitting psychometric functions (Wichmann & Hill, 2001a).Suppose that, on a particular trial, the observer sees the quadruple (

*i, j; k, l*) and judges the first interval to be larger. Given any choice of the free parameters*Ψ*_{2}, …,*Ψ*_{N−1}and*σ*, we can compute the probability that this will occur. It is simply the probability that the decision variable is positive for this choice of free parameters: . This is simply Φ_{σ}(*l*−_{kl}*l*) where is the cumulative distribution function of the Gaussian random variable_{ij}*ε*. If the observer instead selects the second interval, then the probability that he does so is just 1 − Φ_{σ}(*l*−_{kl}*l*). If we code the choice of the first interval as_{ij}*R*= 1 and the choice of the second interval as*R*= 0, then the probability of the subject’s response is written as, where we have expanded the*l*−_{ij}*l*in terms of the parameters_{kl}*Ψ*_{2}, …,*Ψ*_{N−1}so that it is evident that the probability on the left hand side does depend upon the (unknown) values of the free parameters.Of course, there is little to be learned from a single trial. We wish to compute the probability of a pattern of responses,

*R*, across many trials, numbered_{t}*t*= 1, …,*T*. Let the quadruple on trial*t*be (*i*), and define the difference in lengths on trial_{t}, j^{·}_{t}, k_{t}, l_{t}*t*to be Δ_{t}=*φ*−_{j}*φ*−_{i}*φ*+_{l}*Ψ*. Then the probability of the observed pattern of responses for any choice of the free parameters_{k}*Ψ*_{2}, …,*Ψ*_{N−1}, and*σ*, is, which is also the likelihood of any particular choice of the free parameters given the (known) outcome of the experiment*R*_{1}, …,*R*. We form maximum likelihood estimates of the free parameters, and , by finding the values of these parameters that maximize the equation above. There is no closed form for such a solution but it is very easy to compute the estimates by standard numerical optimization methods (Hanselman & Littlefield, 2001, Chapter 22; Press et al., 1992, Chapter 10)._{T}Maximum likelihood estimates have several desirable asymptotic properties. As we base the estimates on more and more data, any bias in the estimates vanishes (Mood, Graybill & Boes, 1974). Further, as the amount of data increases, the maximum likelihood estimates have a variance that converges to the minimum possible for an unbiased estimator (Mood et al., 1974, pp. 358–371). These are desirable properties, and, consequently, most modern estimation procedures are based on maximum likelihood or its close Bayesian relatives (Berger, 1985). However, it is important to assess whether the estimators are biased when used with the amount of data collected in typical experiments. The results based on difference scaling in Figures 3 and 5 are intriguing, but it is certainly important to check whether they are contaminated by biases in estimation. Further, it is important to determine how variable estimates are for any specified number of trials, to estimate confidence intervals for the parameters

*Ψ*_{2}, …,*Ψ*_{N−1}. Before turning to an examination of the ‘small sample’ bias and variability of the MLDS estimates just described, we need to discuss some of the practical aspects of designing and carrying out a difference scaling experiment.Designing a Difference Scaling Experiment

The experimenter selects

*N*stimuli,*S*, …,_{1}*and prepares a list of quadruples, (*_{N}*i, j; k, l*). It is convenient to avoid repetitions in the list of indices. If, for example, the observer is presented with the quadruple (*1,2;1,3*) then he can potentially note that the ‘right’ answer is signaled by the ordering of the second stimulus in each interval. He can then base his judgment solely on the basis of an ordering of the stimuli. The total possible number of stimuli with distinct indices 1≤*i*<*j*<*k*<*l*≤*N*is , which can be written out as, Some values are tabulated for small values of*N*in Table 1. Given any quadruple, (*i,j,k,l*), with 1 ≤*i*<*j*<*k*<*l*≤*N*, it may be presented to the subject in eight different ways. For example, in the color stimuli in Figure 2, we could present (*i,j*) above and (*k,l*below or vice versa. Wherever we chose to present (*i,j*), we can present*S*_{i}on the left and*S*italic; on the right, or vice versa. Similarly for (_{k}*k,l*). Since the ordering of the stimuli (*i*<*j*) should be obvious to the observer, there is little point in bothering to randomize left-right and we do not. We do, however, randomize the locations of (*i, j*) and (*k, l*), effectively flipping a coin to decide which one goes above or below, or, for temporal forced-choice, which one goes first or second.Table 1

Table 1

Number of Stimuli | Number of Quadruples |
---|---|

10 | 210 |

12 | 495 |

15 | 1365 |

20 | 4845 |

These values are computed using Equation 8.

For ten stimuli, then, one first pass through all possible intervals in random order requires 210 judgments (Table 1). Assuming that these forced-choice judgments take no more than a 5 seconds each, it is possible to go through 210 judgments in about 20 minutes or less.

For 20 or more stimuli, the total number of possible trials becomes too large to contemplate. However, it is still possible to carry out difference scaling and parameter estimation using only a fraction of the possible trials. In a later section, we investigate how many trials are actually needed to establish a difference scale for

*N*stimuli for values of*N*larger than 10.Of course, we can also repeat the series of 210 trials for 10 stimuli as many times as we like. In the next section we examine how the bias and reliability of the estimates vary with the number of trials.

Evaluating Bias and Variability

Bias and Variablity as a Function of Number of Trials

The maximum likelihood fitting procedure just described is asymptotically unbiased and asymptotically minimum variance (Mood et al., 1974). However, for experiments involving realistic numbers of trials and typical experimental protocols it is important to assess the bias and standard deviation for experiments involving realistic numbers of trials and choice of quadruples. We evaluated both by Monte Carlo computation.3

For each experiment, we specified a function

*Ψ*(*s*) =*s*relating the nominal scale values to the difference scale values, a value of the standard deviation^{γ}*σ*of the error term in the comparison process, and*n*levels of the nominal scale value,*S*_{1}, …, S_{N}. We then simulated the difference scaling observer in one or more repetitions of a complete design on*n*stimuli. The simulated data were fit to the model to obtain estimates and an estimate of of*σ*. We repeated this cycle of simulation and fitting 1000 times and computed the mean and standard deviations of the estimates and . The mean is an estimate of the expected value of the estimator and the standard deviation, an estimate of the standard deviation of the estimator.In Figure 6 we plot the results of such Monte Carlo replications for five different choices of the power function

*Ψ*(*s*) =*s*with^{γ}*n*= 10 levels of the nominal scale value. A complete design for*n*= 10 includes 210 quadruples. We ran experiments in which the simulated observer completed one repetition of a complete design (210 trials) and also four repetitions of a complete design (840 trials). The value of*σ*was set to 0.2, a value typical of the estimated variability of our subjects in the color experiments.Figure 6

Figure 6

The power functions

*Ψ*(*s*) =*s*are plotted as curves and the estimates of are plotted as points that, if the estimator were unbiased, should lie precisely on the corresponding curve. For 210 trials, the estimator is slightly biased for the case = 2 for nominal values near zero (the points are slightly above the curve). For 840 trials, this bias is no longer visible. Overall the maximum likelihood estimation procedure returns estimates that are close to unbiased, even with a modest number of trials.^{γ}The estimates of the standard deviations are plotted as error bars attached to each point. The error bars allow us to predict how close to the true difference scale values

*Ψ*(*S*_{i}) the estimates could be expected to fall in any given experiment. The simulated experimental conditions closely mimic the conditions of the difference scaling of color in the first example and indicate that the experimental protocol and fitting methods are not introducing any important bias, at least when the distributional assumptions are satisfied.The Effect of Random Subsampling of the Set of Possible Trials.

With 15 or 20 stimuli, the number of trials (quadruples) in a complete design is large (Equation 8). We simulated experiments with

*N*= 10, 12, 15, or 20 stimuli in which observers completed only a subset of a complete design, either 10% or 50% of the quadruples chosen at random. The mapping from nominal values to scale values was always the power function*Ψ*(*s*) = s^{1/2}. For example, with 20 stimuli, the number of quadruples (Table 1) is 4845. In the 10% condition, the observer would complete 484 of these quadruples, chosen at random. In the 50% condition, he or she would complete 2426 quadruples. The responses on this subset of quadruples would be the basis for computing the likelihood function that is then maximized to obtain estimates of the difference scale values. For comparison, we also simulated experiments in which 100% of the possible quadruples were used. We refer to this condition as ‘100% subsampling’. There were a total of 12 experimental conditions (four choices of*N*, three choices of subsampling).We repeated each experimental condition 200 times and computed the standard deviation of the scale value estimates for each scale value, and averaged these standard deviation estimates across scale values. In Figure 7, we show a log-log plot of mean standard deviation versus the number of quadruples used, i.e. we plot the mean standard deviation for 10% subsampling of the 4845 possible trials for 20 stimuli versus the number of trials, 484. The subsampling rates are coded by color, the number of stimuli by shape. The results for 10% subsampling of the 210 possible trials for 10 stimuli (21 trials) did not always converge4 and is omitted from the plot. There were no convergence failures for larger rates of subsampling or larger numbers of stimuli (and correspondingly larger numbers of possible trials).

Figure 7

Figure 7

The variance of estimates decreased as the subsampling rate increased and as the total number of possible trials increased. It is evidently possible to obtain reliable estimates with as many as 20 stimuli with 10% subsampling. We have approximately the same variance for 484 trials taken as a 10% subsampling of the 4845 possible trials for 20 stimuli and for 495 trials taken as a 100% sample of the 495 possible trials for 12 stimuli (Figure 7). The logarithm of variance (the square of the standard deviation) is roughly a linear function of the logarithm of the total number of trials over the range we considered. We estimated the slope of the best-fitting line by linear regression. This line, with slope −0.73, is shown in the plot.

These results indicate that it is feasible to use difference scaling with 20 or more stimuli. If

*M*is the total number of trials used in estimate a difference scale by means of MLDS, the expected variance of the parameter estimates is roughly proportional to*M*^{3/4}over the range considered (replacing 0.73 by 3/4). That is, the expected standard deviations of the parameter estimates (the error bars in Figure 6) are roughly proportional to*M*^{3/8}.Distributional Robustness

Effect of Changes of Distributional Family

In developing the maximum likelihood fitting method, we modeled the observer’s judgment of a quadruple (

*i,j;k,l*) as comparison of a stochastic decision variable*D*(*i,j;k,l*) to a threshold. The decision variable was Gaussian and its standard deviation did not depend on the magnitudes of the stimuli or the differences of stimuli under comparison. In this section, we examine how violations of either of these assumptions affect the estimates returned by the fitting method.We first developed models of observers who carried out the difference scaling task by forming a decision variable based on the difference of the lengths of two intervals, perturbed by additive random error, just as before: . The additive error was independent and identically-distributed, but it was not always Gaussian in form. We simulated observers where the distribution was Uniform on the interval (−

*s*), Laplacian or Cauchy in form. The Laplacian has the probability density function, and the Cauchy has probability density function, . All three distributions are symmetric about 0. We simulated replications of a difference scaling experiment with each of the Uniform, Cauchy, and Laplacian Observers in turn, and fitted the resulting data with the Gaussian maximum likelihood methods described previously. We adjusted the scale parameters_{u}, s_{u}*s*, so that the fitted estimate of observer variability was roughly the same as that observed in our experiments with human observers. The Laplacian and Cauchy are ‘high-tailed’ distributions that are often used in testing the distributional robustness of fitting methods based on the Gaussian distribution. They tend to have relatively large numbers of ‘outliers’ compared to the Gaussian. If the results of the maximum likelihood fitting method change markedly when the true distribution of the observer is not Gaussian, then the fitting method is not distributionally robust. Conversely, if we get the same estimates of the scale values_{u}, s_{c}, s_{l}*Ψ*despite changes in the error distribution, then the Gaussian assumption can be treated as a convenience, not a crucial assumption we employ in using the method._{i}In Figure 8, we plot the mean fitted values obtained using the MLDS procedure versus true values of

*Ψ*for the Uniform, Cauchy and Laplacian observers. As can be seen, changes in distributional form lead to only small biases. The MLDS procedure is distributionally robust._{i}Figure 8

Figure 8

Effect of Non-Uniform Variance

For all of the simulated observers we assumed that the variability of the additive error was independent of the magnitude of the difference between the interval lengths, an homogeneity of variance assumption. In Figure 9, we report the results of simulating a Non-Uniform Variance Observer and fitting the resulting data assuming homogeneity of variance. The variance was proportional to the absolute value of the difference

*l*−_{kl}*l*and the constant of proportionality was chosen so that the fitted value was roughly the fitted variability of human observers in our experiments. In Figure 9, we plot the mean fitted values obtained using the Gaussian MLE procedure versus true values of_{ij}*Ψ*for the Non-Uniform Variance observers (compare Figures 6 and 9). In summary, the Gaussian MLE procedure described above is distributionally robust over the range of conditions considered._{i}Figure 9

Figure 9

Axiomatic Issues and Validation

Interpretation of the results of any psychophysical estimation procedure depends on the assumptions made in modeling and fitting the data. In estimating a sensory threshold we assume, for example, that the threshold we are estimating is roughly stable over the period of time employed in measuring it. In considering any particular application of difference scaling, the experimenter can seek to determine whether the model used in fitting the data was appropriate to the observer and the particular experimental situation. We have just seen that the results of the fitting procedure are not much affected by failures of the Gaussian distributional assumption. However, we can consider other basic assumptions made in applying difference scaling methods as we have described.

Krantz et al. (1971) contains derivations of necessary and sufficient conditions that an observer must satisfy if we are to conclude that his judgments can be described by a difference scaling model. Two of these represent testable claims about human performance.

The first is the Ordering Property. The observer must be able to reliably order the stimuli

*S*_{1}, …, S_{N}in agreement with the ordering of the scale values*Ψ*_{1}<*Ψ*_{2}< …*φ*. Often, in difference scaling applications, this ordering property is evidently satisfied. Ordering the color samples in Figure 2 is not difficult for anyone with normal color vision._{N}The second is the Six-Point Property, illustrated in Figure 10. There are two groups of three stimuli whose indices are

*i,j,k*and*i′,j′,k′*, respectively. For convenience, let us assume that*i*<*j*<*k*and*i′*<*j′*<*k′*as indicated in the figure. Otherwise, we put no restrictions on the ordering of the six stimuli. Suppose that a non-stochastic observer considers the quadruple (*i,j, i,j′*) and judges that*ij*<*i′j′*. On some other trial, he considers (*j,k, j′, k′*) and judges that (*jk*<*j′k′*). Now, given the quadruple, (*i, k, i, k′*), there is only one possible response consistent with the difference scaling model. He must choose*ik*<*i′k′*. The reasoning behind this constraint is illustrated in the figure and it can be demonstrated directly from the model. Since the observer has judged that*ij*<*i′j′*, we known that and since he has judged that*jk*<*j′k′*, we also know that Adding, Equations 12 and 13, we have, and the observer must judge that*ik*<*i′k′*, as stated above. For the non-stochastic observer, even one violation of this six-point condition would allow us to conclude that there was no consistent assignment of scale values*Ψ*_{1},*Ψ*_{2},…*Ψ*. in a difference scaling model that could predict his or her judgments in a difference scaling task._{N}Figure 10

Figure 10

Of course, the non-stochastic observer is an implausible model of human behavior: human judgments in difference scaling tasks are not deterministic. In MLDS, decisions are based on a decision variable

*D*(*i,j;k,l*) and, given any six points*i,j,k,i′,j′,k′*there is a non-zero probability that the stochastic observer will violate the six-point condition. In particular, suppose that*Ψ*−_{j}*Ψ*is only slightly greater than_{i}*Ψ*−_{j′}*Ψ*−_{i′}, Ψ_{k}*Ψ*is only slightly greater than_{j}*ψ*−_{k′}*Ψ*, and that_{j′}*Ψ*−_{k}*Ψ*is only slightly greater than_{i}*Ψ*−_{k′}*Ψ*. Then we might expect that the observer has roughly a probability of 0.5 of judging_{i′}*ij*>*i′j*and similarly with the other two quadruples. Hence, he has an appreciable change of judging*ij*>*i′j′, jk*>*j′k′*and*ik*<*i′k′*, violating the six-point property.In deciding whether an observer’s judgments are consistent with the six-point property and the difference scaling model, we must take into account that the stochastic observer is expected to occasionally violate the six-point condition. We need to develop a method for deciding whether a particular pattern of violations of the six-point condition is sufficiently suspicious to warrant rejecting the difference scaling model as a model of the subject’s performance. First, we enumerate all possible sextuples

*i,j,k,i′,j′,k′*and denote them*Ψ*_{1}, …,*Ψ*_{N6}. Given ML estimates of the scale values used by the human observer , and the observer’s standard deviation,*σ*, we can compute the probability that the observer would violate any given six-point condition*Ψ*= (_{a}*i, j, k,i′, j′, k′*) by judging*ij*>*i′j′, jk*>*j′k′*but*ik*<*i′k′*. This probability is denoted*p*5._{a}We can now count the number of times,

*V*, that the observer has violated the six-point condition_{a}*τ*during the course of the experiment and number of times he has satisfied it,_{a}*S*. The likelihood of this outcome, for this one choice of a six-point condition is, , and we can now compute the overall likelihood of the observed outcome by taking the products of the likelihoods of all possible six-point conditions: Of course, if the observer never judges one of the three pairs of outcomes in a sextuple, perhaps because of subsampling, then_{a}*V*=*S*= 0 and the corresponding likelihood is one.We have developed a Monte Carlo method for comparing the observed likelihood of the observer’s six-point judgments to the expected distribution of this likelihood for the ideal stochastic observer with scale values that are the same as the fitted scale values and standard deviation for the observer.

In Figure 11, we show the histogram of the logarithm of replications of the six-point likelihood Λ

_{6}for a simulated ideal stochastic observer whose parameters were the fitted parameters of Observer SHC in Figure 3. The logarithm of the actual six-point likelihood Λ_{6}for the observer is marked by a vertical line. The values in the histogram are measures of the extent to which a simulated observer, perfectly described by the MLDS model, could reasonably be*expected*to violate the six-point condition over the course of a particular experiment.Figure 11

Figure 11

If the human observer SHC does in fact satisfy the MLDS model, then his or her Λ

_{6}should tend to fall near the center of this distribution. Had the actual value fallen below the*α*percentile of the histogram, we would instead reject the hypothesis that the difference scale model is appropriate for this observer at the*α*-level. The simulated MLDS observer has such an extreme value of Λ_{6}on fewer than*α*proportion of the replications. In that case, we judge that the pattern of six-point failures in the observer’s data is too improbable and, following the ordinary logic of hypothesis testing, we reject the hypothesis that the MLDS model is appropriate for SHC at the*α*-level.In Figure 11, for example, an estimate of the

*α*= 0.05 level is marked by a vertical black line. This estimate is simply the 5^{th}percentile of the histogrammed values. The shaded rejection region is to the left of the line. The observer’s true value does not fall within the rejection region and we do not reject.Conclusions

Difference scaling provides a rigorous means for assessing suprathreshold differences and it is the natural complement to methods for measuring thresholds along sensory continua. By means of examples, we illustrated the range of its applicability. We have presented a stochastic model of judgments in tasks involving ranking of suprathreshold perceived difference of stimuli that fall on a one-dimensional continuum. We described methods for estimating the parameters of the model from data and described how to design difference scaling experiments. We found that the estimator is nearly unbiased in realistic experimental designs and that it is distributionally robust. While the number of possible trials grows rapidly with the number of stimuli, we found that the experimenter could obtain reliable estimates with only a small fraction of all of the possible comparisons.

In many respects, MLDS is a stochastic, unidimensional form of metric and non-metric multidimensional scaling methods (Shepard, 1962ab, 1974; Torgerson, 1958). In MLDS the error model that explains observers’ inconsistencies in judging the same stimuli is made explicit and plays a central role. The use of maximum likelihood methods guarantees that MLDS is asymptotically unbiased and minimum variance. The simulations reported here suggest that MLDS exhibits little bias and narrow confidence intervals for parameters with only modest amounts of data. In that respect, it resembles psychometric fitting methods (Wichmann & Hill, 2001ab). There are no comparable guarantees for traditional MDS methods.

We also showed how to test necessary and sufficient conditions for difference measurement in the case where observers’ estimates are perturbed by stochastic error. If both the ordinal and six-point conditions hold, then observers’ performance is consistent with a difference measurement model. If an observer’s performance is not consistent with the difference scaling model, then one of these two conditions must fail. By translating the tests of these conditions into statistical hypothesis tests, we, of course, leave open the possibility that the difference measurement model will fail to be rejected because the corresponding statistical test lacks power. However, nothing better is to be expected if we do not wish to reject the model because of minor stochastic variation in an observer’s judgment. There are analogous tests of necessary conditions for multidimensional representations (Hutchinson & Tversky, 1986; Gerrig et al., 1991) but, currently, no tests of necessary and sufficient conditions.

The methods and analyses presented in this article complement earlier work by Schneider and colleagues (Schneider, 1980ab; Schneider et al., 1974). The fitting method proposed by Schneider and colleagues is non-metric and does not explicitly model stochastic variability in observers’ responses. Consequently they cannot directly address hypothesis testing or issues of bias and efficiency. MLDS is asymptotically unbiased and efficient (minimum variance) because it estimates parameters by maximizing likelihood. There is no comparable guarantee for the method of Schneider and colleagues. Further, MLDS lends itself to hypothesis testing, most notably in evaluating the match between human performance and the axioms of difference measurement. Schneider and colleagues can argue that the number of axiom violations in a data set is small, but cannot reasonably decide whether the violations observed constitute grounds for rejecting the difference measurement model. As we noted above, for certain configurations where there are many near ties in interval lengths, we might expect a large number of six-point violations from an observer whose behavior is actually consistent with the six-point condition. MLDS allows us to determine whether the pattern of six-point violations observed is inconsistent with the six-point condition.

There are other, classical methods for assigning scale values to stimuli on a one-dimensional continuum (McIver & Camines, 1981) of which Thurstonian scaling is perhaps the best known (Thurstone, 1927). These methods differ markedly from MLDS. They rely heavily on confusions between adjacent stimuli on a scale6, while MLDS compares both large and small intervals. Scales estimated using Thurstonian scaling, for example, are known to be sensitive to the choice of the error distribution. We note that there is no reason to expect that a scale based on measurements of just-noticeable-differences between stimuli could be used to predict supra-threshold perceptual differences.

The basic task in MLDS is a comparison of intervals (a quadruple). Other scaling methods use other tasks, notably the method of triads or duo-triads (Torgerson, 1958, pp. 262ff; Frijters, 1979; Ennis & Mullen, 1986). In one form of the method of triads, the observer considers three stimuli, A, B, and C and decides whether the difference between A and B is greater than, or less than, the difference between A and C. We can reinterpret this judgment as a comparison of intervals: the observer is deciding whether the interval AB is greater than, or less than, the interval AC. The method of triads, in this form, is evidently a special case of the quadruples task in which the four stimuli need not be distinct. In its earliest form (Torgerson, 1958, pp. 262ff), the observer must choose which of the three intervals, AB, AC, BC is smallest (or alternatively, largest). Again, the method of triads reduces to a comparison of intervals. Torgerson describes other methods that amount to comparison of intervals. Any method that permits the experimenter to establish an interval scale on a one-dimensional continuum of stimuli must, directly or indirectly, involve comparison of intervals.

The number of possible triads for N stimuli is evidently less than the number of possible quadruples. For ten stimuli, for example, there are 135 triads and 210 quadruples composed of four distinct stimuli. It seems, at first glance, that an experiment involving triads should involve fewer trials than an experiment involving quadruples. However, we showed in the ‘Evaluating Bias and Variability’ Section that the number of

*actual*trials, not the number of*possible*trials, determines the standard deviations of the resulting scale estimates. The experimenter, who seeks a lower standard deviation in estimating a scale on 10 stimuli, is advised to repeat the 210 possible trials as many times as needed. In our simulations we show the effect of repeating the 210 possible trials up to four times (840 trials total). Conversely, we show that the experimenter who wishes to scale a large number of stimuli can use only a subset of the possible quadruples and get reliable results. There is no reason to use one subset of possible quadruples (e.g. the ‘triadic quadruples’) rather than another simply because the former subset has fewer elements.In all of our analyses we considered only quadruples on four distinct stimuli with each quadruple used in a simulation chosen from the set of such quadruples with equal probability. This choice was arbitrary. It would be of interest to determine whether experiments based on some subsets of quadruples are more efficient than others. It would also be of interest to develop adaptive procedures that select the next quadruple to be presented based on the observer’s previous responses.

Scaling methods can be controversial (Gescheider, 1988). In advancing MLDS, we view it as a method for fitting a plausible model (difference measurement) to data. The model contains a precise description of how the observer arrives at each judgment including possible stochastic variation in response. The fitting methods and methods for setting confidence intervals and testing hypotheses are based on current practice in statistics. Moreover, MLDS is remarkably distributionally robust. Consequently, we suggest that the only controversy that might be relevant is whether the difference measurement model is an adequate model of human judgment of suprathreshold perceptual differences, a controversy that is best resolved by further experimentation with the method.

Acknowledgments

This research was funded in part by Grant EY08266 from the National Institute of Health. LTM was also supported by grant RG0109/1999-B from the Human Frontiers Science Program. We thank Kenneth Knoblauch and Michael Landy for comments on an earlier draft of this article. We especially thank Kenneth Knoblauch and colleagues for permission to use the images and data in Figures 4 and 5. Commercial relationships: none.

Footnotes

Footnotes

Footnotes

^{2}We can also formulate a multiplicative model in which the observer compares ratios of ratios of magnitude rather than differences of differences and where the error is multiplicative. The resulting model, with an appropriate choice of multiplicative error distribution, is behaviorally indistinguishable from the additive error form described here. The multiplicative model becomes the additive if we simply take the logarithm of the decision variable.

Footnotes

Footnotes

Footnotes

Footnotes

^{6}Unlike most variants of Thurstonian scaling that are based on analysis of a measured confusion matrix, Watson & Kreslake (2001) propose a method for finding just-noticeable-differences (JNDs) within a given physical scale.

References

Berger, J. O.
(1985).

*Statistical Decision Theory and Bayesian Analysis*. New York: Springer.
Efron, B.
Tibshirani, R. J.
(1993).

*An introduction to the bootstrap*. New York: Chapman Hall.
Ennis, D. M.
Mullen, K.
(1986). A multivariate model for discrimination methods.

*Journal of Mathematical Psychology*, 30, 206–219. [CrossRef]
Farell, B. and
Pelli, D. G.
(1998). Psychophysical methods, or how to measure a threshold and why. In
J. G., Robson and
R. H. S., Carpenter
(Eds.),

*A Practical Guide to Vision Research*. New York: Oxford University Press.
Frijters, J. E. R.
(1979). Variations of triangular method and the relationship of its unidimensional probabilistic models to three-alternative forced-choice signal detection theory models.

*British Journal of Mathematical and Statistical Psychology*, 32, 229–241. [CrossRef]
Gerrig, R. J.
Maloney, L. T.
Tversky, A.
(1991), Validating the dimensional structure of psychological spaces: Applications to personality and emotions. In
D. R., Brown
J. E. K., Smith
[Eds.],

*Frontiers of Mathematical Psychology*. New York: Springer-Verlag. pp. 138–165.
Gersho, A.
Gray, R. M.
(1991),

*Vector Quantization and Signal Compression*. Kluwer Academic Publishing.
Gescheider, G. A.
(1988), Psychophysical scaling.

*Annual Review of Psychology*, 39, 169–200. [PubMed] [CrossRef] [PubMed]
Green, D. M.
Swets, J. A.
(1966/1974)

*Signal Detection Theory and Psychophysics*. New York: Wiley. Reprinted 1974, New York: Krieger.
Hanselman, D
Littlefield, B.
(2001),

*Mastering MATLAB 6; A Comprehensive Tutorial and Reference*. Upper Saddle River, NJ: Prentice-Hall.
Hutchinson, J. W.
Tversky, A.
(1986). Nearest neighbor analysis of psychological spaces.

*Psychological Review*, 93, 3–22. [CrossRef]
Knoblauch, K.
Charrier, C.
Cherifi, H.
Yang, J. N.
Maloney, L. T.
(1998) Difference scaling of image quality in compression-degraded images [Abstract].

*Perception (Supplement)*, 27, 174.
Krantz, D. H.
Luce, R. D.
Suppes, P.
Tversky, A.
(1971).

*Foundations of measurement (Vol. 1): Additive and Polynomial Representation, Chapter 4*. New York: Academic Press.
McCourt, M. E.
Blakeslee, B.
(1994). Contrast-matching analysis of grating induction and suprathreshold contrast perception.

*Journal of the Optical Society of America A*, 11(1), 14–24. [PubMed] [CrossRef]
McIver, J. P.
Camines, E. G.
(1981).

*Unidimensional scaling*. Beverly Hills, CA: Sage Publications.
Mood, A.
Graybill, F. A.
Boes, D. C.
(1974).

*Introduction to the theory of statistics*, 3rd Edition, pp. 358–371. New York: McGraw-Hill.
Press, W. H.
Teukolsky, S. A.
Vetterling, W. T.
Flannery, B. P.
(1992).

*Numerical recipes in C: The art of scientific computing*, 2nd Ed. Cambridge, UK: Cambridge University Press.
Roberts, F. S.
(1979).

*Measurement theory with applications to decision making, utility and the social sciences*. Reading, MA: Addison-Wesley.
Schneider, B.
(1980a). Individual loudness functions determined from direct comparisons of loudness intervals.

*Perception and Psychophysics*, 28(6), 493–503. [PubMed] [CrossRef] [PubMed]
Schneider, B.
(1980b). A technique for the nonmetric analysis of paired comparisons of psychological intervals.

*Psychometrika*, 45(3), 357–372. [CrossRef]
Schneider, B.
Parker, S.
Stein, D.
(1974). The measurement of loudness using direct comparisons of sensory intervals.

*Journal of Mathematical Psychology*, 11, 259–273. [CrossRef]
Shepard, R. N.
(1962a). The analysis of proximities: Multidimensional scaling with an unknown distance function. I.

*Psychometrika*, 27, 125–140. [CrossRef]
Shepard, R. N.
(1962b). The analysis of proximities: Multidimensional scaling with an unknown distance function. II.

*Psychometrika*, 27, 219–246 [CrossRef]
Shepard, R. N.
(1974), Representations of structure in similarity data: Problems and prospects.

*Psychometrika*, 39, 373–421. [CrossRef]
Takasaki, H.
(1966). Lightness change of grays induced by change in reflectance of gray background.

*Journal of the Optical Society of America*, 56, 504–509. [PubMed] [CrossRef] [PubMed]
Thurstone, L. L.
(1927). A law of comparative judgment.

*Psychological Review*, 34, 273–286. [CrossRef]
Torgerson, W. S.
(1958).

*Theory and methods of scaling*. New York: Wiley.
Ward, F.
Boynton, R. M.
(1973). Scaling of large chromatic differences.

*Vision Research*, 14, 943–949. [CrossRef]
Watson, A. B.
Kreslake, L.
(2001). Measurement of visual impairment scales for digital video.

*Proceedings of the SPIE*, 4299, 79–89. [Article]
Whittle, P.
(1992). Brightness, discriminability and the “Crispening Effect.”

*Vision Research*, 32, 1493–1507. [PubMed] [CrossRef] [PubMed]
Wichmann, F. A.
Hill, N. J.
(2001a). The psychometric function: I. Fitting, sampling and goodness of fit.

*Perception & Psychophysics*, 63, 1293–1313. [PubMed] [CrossRef]
Wichmann, F. A.
Hill, N. J.
(2001b). The psychometric function: II.

*Bootstrap-based confidence intervals and sampling*.*Perception & Psychophysics*, 63, 1314–1329. [PubMed]
Wyszecki, G
Stiles, W. S.
(1982).

*Color science: Concepts and methods, quantitative, data and formulas*, 2nd Ed. New York: Wiley.