H. R. Blackwell (1952) investigated the influence of different psychophysical methods and procedures on detection thresholds. He found that the temporal two-interval forced-choice method (2-IFC) combined with feedback, blocked constant stimulus presentation with few different stimulus intensities, and highly trained observers resulted in the “best” threshold estimates. This recommendation is in current practice in many psychophysical laboratories and has entered the psychophysicists' “folk wisdom” of how to run proper psychophysical experiments. However, Blackwell's recommendations explicitly require experienced observers, whereas many psychophysical studies, particularly with children or within a clinical setting, are performed with naïve observers. In a series of psychophysical experiments, we find a striking and consistent discrepancy between naïve observers' behavior and that reported for experienced observers by Blackwell: Naïve observers show the “best” threshold estimates for the spatial four-alternative forced-choice method (4-AFC) and the worst for the commonly employed temporal 2-IFC. We repeated our study with a highly experienced psychophysical observer, and he replicated Blackwell's findings exactly, thus suggesting that it is indeed the difference in psychophysical experience that causes the discrepancy between our findings and those of Blackwell. In addition, we explore the efficiency of different methods and show 4-AFC to be more than 3.5 times more efficient than 2-IFC under realistic conditions. While we have found that 4-AFC consistently gives lower thresholds than 2-IFC in detection tasks, we have found the opposite for discrimination tasks. This discrepancy suggests that there are large extrasensory influences on thresholds—sensory memory for IFC methods and spatial attention for spatial forced-choice methods—that are critical but, alas, not part of theoretical approaches to psychophysics such as signal detection theory.

*statistical*properties of different procedures (e.g., Garcia-Perez, 1998; Green, 1990; Kaernbach, 1991; Kontsevich & Tyler, 1999; Laming & Marsh, 1988; Leek, Hanna, & Marshall, 1992; Snoeren & Puts, 1997; Treutwein, 1995; Watson & Pelli, 1983) and threshold estimation once data collection is complete (e.g., Foster & Bischof, 1991; Kaernbach, 2001; Kuss, Jäkel, & Wichmann, 2005; Maloney, 1990; McKee, Klein, & Teller, 1985; Miller & Ulrich, 2001; Treutwein & Strasburger, 1999; Wichmann & Hill, 2001a, 2001b). In addition, signal detection theory (SDT; Green & Swets, 1988) offers a theoretical framework in which thresholds obtained using forced-choice methods can be converted to the equivalent single-interval thresholds and vice versa. SDT attempts to explain detection and discrimination performance in terms of sensory and decision processes: the (sensory) noise and signal-plus-noise distributions on the putative internal decision axis and the (decision) criterion adopted by the observer. We know, however, that this conception of detection and discrimination does not tell the whole story: Attention influences detection and discrimination, and at least under some circumstances, the influence manifests itself on sensory processing, for example, a sharpening of spatial frequency or orientation tuning (e.g., Itti, Koch, & Braun, 2000; Lee, Itti, Koch, & Braun, 1999). Thus, SDT's assumption of having a fixed sensory front end proved to be incorrect. This opens the possibility that different psychophysical methods require more or less attention and thus yield significantly different results despite the prediction of SDT that thresholds should be convertible using SDT formalisms. Furthermore, some methods or procedures may be easier to learn or may feel more “natural” to human observers, particularly naïve observers.

*psychological*than statistical consequences of the different methods and procedures. The most thorough examination of different methods for visual threshold measurements was reported in a seminal paper by Blackwell (1952). Blackwell identified three criteria by which to judge psychophysical methods and procedures:

- Sensory determinacy. Methods that give lower thresholds are to be preferred as higher threshold values may indicate that the method makes observers more prone to unwanted extrasensory influences.
- Reliability. This refers to the extent to which threshold measurements vary over time under what seem to be identical experimental conditions.
- Inferred validity. This refers to the extent to which variables that are thought to be irrelevant influence threshold measurements.

- 2-AFC is to be preferred over yes–no tasks and 4-AFC.
- Forced choice should involve temporal intervals rather than spatial locations.
- Stimuli should be grouped into blocks of the same magnitude rather than being randomized; that is, a block design should be used.
- Use as few stimuli, for example, signal intensities, as practicable.
- Feedback should be provided.
- Participants should have extensive experience in threshold measurements; that is, one should work with trained observers.

*efficiency*of the method of constant stimuli. Showing two alternatives at two spatial locations is almost twice as fast as showing two alternatives one after the other (efficiency, however, appears not to have been a concern for psychophysicists in the early 1950s; this is one of the very few aspects of psychophysical methods and procedures Blackwell did not explore).

*m*-AFC, with

*m*taken from {2, 4, 8}. Suppose that an experimenter already has a rough idea about where the psychometric function lies and now has to choose some stimuli for presentation. Usually, experimenters will try to distribute the stimuli evenly such that they cover the whole range, but other sampling schemes are possible and more efficient (Wichmann & Hill, 2001a). Then, the observer is presented with

*N*trials of one of these stimuli and produces

*p*correct answers. Assuming that the correct answers come from a binomial distribution with a fixed probability, the expected variance of the data (i.e., variance of the number of correct answers) is given by

*Np*(1 −

*p*). In the right panel of Figure 1, we plot

*p*(1

*−*

*p*) for different stimulus values, and it can be seen that for 2-AFC, there are large regions of stimulus space for which the expected variance is higher than the variance for

*m*-AFC with

*m*> 2. Indeed, unless the psychometric function was sampled very inefficiently using only positive stimulus values on the axes of Figure 1, the higher

*m*is, the better the psychometric function estimation for a given number of trials is. In addition, for greater

*m*, the point with the highest variance is shifted to the steeper part of the psychometric function where changes in the stimulus result in greater changes in the response probability. Thus, it is worth exploring how well observers do in spatial 4- or 8-AFC, which are conditions not (or not satisfactorily) explored by Blackwell, and it is worth seeing whether the increased efficiency is worth giving up Blackwell's Recommendations 1 and 2.

*statistically*more efficient than the method of constant stimuli (e.g., Watson & Fitzhugh, 1990—this issue is not uncontroversial, see Hill, 2001, pp. 225–228). However, adaptive procedures violate Recommendations 3 and 4 by Blackwell. On every trial, or on nearly every trial depending on the adaptive procedure, a new stimulus is presented to the observers, and this may prevent them from learning to improve their performance for this stimulus. Furthermore, the frequent change of stimuli may make it hard for (naïve) observers to concentrate on particular features of the stimulus in question. Finally, some adaptive procedures are very sensitive to serial dependencies in the participant's responses, which mislead the procedure (Burns & Corpus, 2004; Friedman, Carterette, Nakatani, & Ahumada, 1968; Lages & Treisman, 1998; Treisman & Williams, 1984). Thus, it is not surprising that in real psychophysical experiments, the reliability of adaptive procedures is lower than that of constant stimuli methods (Woods & Thomson, 1993). Whereas one may be willing to sacrifice reliability for speed in certain settings, reliability is usually more important for basic scientific questions. Similarly, unless one is investigating a phenomenological aspect of perception, AFC tasks are preferred to yes–no tasks, as shown by Blackwell and others (e.g., Derrington & Henning, 1981). Hence, in this article, we will focus on forced-choice tasks and the method of constant stimuli, but we will explore whether we can improve the efficiency of

*m*-AFC by using more alternatives—

*m*taken from {2, 4, 8}—without sacrificing reliability, inferred validity, and sensory determinacy (low threshold values).

*m*in an

*m*-AFC paradigm may introduce new problems that possibly outweigh this advantage. For 2-AFC, response biases are usually thought to be low for trained observers, and they can be corrected. This is not necessarily so when

*m*> 2, which may pose an even more serious problem for naïve observers (Green & Swets, 1988). Any assessment of a psychophysical method should thus include an estimation of response biases, an issue that was not yet well appreciated in Blackwell's days.

*m*-AFC tasks in naïve and experienced observers. We followed Blackwell's Recommendations 3, 4, and 5, namely, block design with as few signal intensities as practicable, combined with feedback. The number of signal levels, the amount of randomization, and whether or not feedback was provided were the variables that contributed to inferred validity in Blackwell's original study. We kept all those at the values Blackwell found to be optimal, as we see no reason to change those or explore them again. Our main aim is to explore the consequences of giving up Recommendation 6, experienced observers, on Recommendations 1 and 2, the number of response alternative in forced-choice methods and whether we should use temporal or spatial intervals.

*detection*in a standard CSF measurement. Our main finding is that spatial 4-AFC is the most reliable and most efficient method. The other section is concerned with

*discrimination*in a sinusoidal contrast discrimination task. Here, we concentrate on the extrasensory components that influence the results for a spatial 4-AFC task, namely, spatial attention. For discrimination, 4-AFC is still the most efficient method; however, the thresholds are no longer the lowest—but they are consistent across experimental variations.

^{2}; none of the detection targets presented changed the mean luminance of the display. Pixels on the monitor were carefully adjusted to be square with 0.39-mm sides. Observers sat in a dimly lit experimental cubicle that was an arm's length away from the screen (38 cm) with their heads on a chin rest; observers viewed the screen binocularly. The experiment was controlled by a special-purpose software using the MATLAB (MathWorks, Inc.) toolbox provided by Cambridge Research Systems. Stimuli (targets) were horizontally oriented sine wave gratings at five different spatial frequencies: 0.5, 1.1, 2.1, 4.3, and 8.5 cpd, corresponding to wavelengths of 32, 16, 8, 4, and 2 pixels on the screen. All stimuli were bitmaps with a size of 99 × 99 pixels (5.9°); they were spatially vignetted using a modified Hanning window with a central circular patch of full contrast of radius 25 pixels (diameter, 3°), and beyond this radius, the stimulus contrast was ramped down to zero with a cosine at a radius of 25 to 50 pixels (ring of diameter, 3–5.9°). The (spatial) AFC methods presented the alternatives simultaneously on the screen. In the 8-AFC task, the possible locations on the screen were determined by the cells of a regular 3 × 3 grid; the central cell of the 3 × 3 grid was not used. The eight possible locations for the center of the stimulus were (−50,−50), (−50,0), (−50,50), (0,−50), (0,50), (50,−50), (50,0), and (50,50) pixels from the center of the screen. For the 4-AFC task, only the corners of this square were used. For the 2-AFC task, only the locations left and right of the center were used. This means that the stimulus appeared only at 2.9° eccentricity in the 2-AFC task, only at 4.2° in the 4-AFC task, but at both eccentricities in the 8-AFC task. A pilot study (data not shown) indicated that this difference in eccentricity did not have a large and systematic influence on the detection thresholds. This is because the stimuli were large (5.9°/99 × 99 pixels) compared with the differences in center offsets. The observers' responses during

*m*-AFC were collected using a touch screen. The touch screen (IntelliTouch, ELO TouchSystems, with a 1,200 × 1,000 pixel resolution) was mounted as close as possible in front of the monitor using a frame that was built for this purpose. Pilot data (not shown) indicated that the mapping between the monitor coordinates and the touch screen coordinates was a simple affine transformation. The observers' responses were calibrated to precision, which is in the range of a few millimeters, by means of a least squares fit to 18 well-defined calibration targets displayed on the monitor prior to each experimental session. Pilot data indicated further that the variability in an observer's pointing movements to well-defined targets on the monitor had a standard deviation of 10 pixels (4 mm). For the 2-AFC and 4-AFC tasks, the response cells were 100 × 100 pixels in size, and responses could, thus, always be assigned to cells unambiguously. In the 8-AFC task, however, response cells were only 50 × 50 pixels in size. In our pilot study, we found very occasional misassignments for response cells of this size when observers were instructed to respond as fast as possible. In the experiments reported in this article, however, we instructed all observers to point as accurately as possible without undue time pressure; thus, we do not expect our data to be contaminated by a significant number of misassignments. Each trial for all conditions started with a fixation cross that was displayed at the center of the screen.

*m*-AFC task, the centers of the

*m*different alternatives where a target could appear were marked with single pixels. One of the

*m*alternatives was picked randomly and independently on every trial with equal probability. After 100 ms, a beep indicated the start of the stimulus presentation. The fixation cross and the marks for the alternatives disappeared, and 200 ms later, the target sine wave grating was presented. The temporal characteristics of the target presentation followed a modified Hanning window (100 ms fading-in using a cosine ramp, 100 ms nominal contrast presentation, 100 ms fading-out using a cosine ramp). Another beep indicated the end of the presentation; fixation cross and marks reappeared, and observers touched the mark on the screen where they believed the target to have been.

*m*-AFC task (nominal presentation time, 300 ms); the interstimulus interval had a length of 700 ms, and the beginning of both observation intervals was marked by a beep. A third beep prompted the observer to respond using a button box.

*m*-AFC, as well as the IFC variants, received auditory feedback as to whether their response had been correct. Altogether, there were 25 experimental conditions per observer: five spatial frequencies (0.5, 1.1, 2.1, 4.3, and 8.5 cpd) and five methods (2-AFC, 4-AFC, 8-AFC, 2-IFCp, and 2-IFCf). For each of the conditions, the psychometric function relating the probability of a correct response to contrast was obtained. We obtained two psychometric functions with eight stimuli and 400 trials each on two different days for each observer and condition. This allows us to assess the stability of the psychometric function over time. Four naïve observers (K.P., F.E., R.Z., and D.C.; two female, two male; mean age, 25 years) and one highly experienced psychophysicist who has performed at least 1 million 2-IFC trials during his distinguished career (G.B.H.) took part in this study. In total, we thus conducted 5 × 5 × 5 × 800 = 100,000 detection trials (the total number was not exactly 100,000: Participant D.C. performed only 400 trials instead of 800 trials for 2 of his 25 conditions and other participants did more than 800 trials for some conditions. The total number of trials that we analyzed is 107,850. If we were to include the 5 trials at the beginning of each block that were discarded for the analysis, we then would have 118,635 trials altogether).

*m*-AFC tasks and controlled eye movements with an eye tracker (Eyelink II, SR Research). This was done to check whether it was possible for the naïve observers to keep their fixation on the fixation cross during the various spatial

*m*-AFC conditions. Fixation stability of R.Z. was very good: In the extremely rare case of R.Z. initiating a saccade, it would only start after the stimulus had already disappeared.

*s*

*m*, a lapse parameter

*λ*, and a function

*F*with two parameters

*α*and

*β*that control the threshold and the slope of the psychometric function. Here, we have always chosen

*F*to be a Weibull function. Figure 2 shows the functions for one observer at one spatial frequency measured with the five different tasks.

*m*) and the maximum performance of the observer (determined by his or her lapse rate), that is, the stimulus

*s*for which

*F*(

*s*;

*α*,

*β*) = 0.5. Contrast sensitivity is defined as the reciprocal value of this 50% threshold.

*χ*

^{2}statistic, but more generally appropriate for maximum-likelihood fitting in contexts other than the least squares setting. Given a fit, Monte Carlo simulations can determine the distribution of this summary statistic against which the calculated value can be compared to judge significance.

*m*to lead to smaller confidence intervals. In the following, we only consider the most frequently used summary statistic: the 50% threshold (as before, by 50% threshold, we refer to the stimulus intensity

*s*such that

*F*(

*s*;

*α*,

*β*) = 0.5 and

*not*Ψ(

*s*;

*α*,

*β*) = 0.5). For each fit, we also calculated bootstrapped standard deviations for this quantity. However, some care has to be taken when one tries to compare these because for sinusoidal grating detection, the slope of the psychometric function is correlated with the threshold (for K.P., the correlation was 0.80; for F.E., it was 0.52; for R.Z., it was 0.83; for D.C., it was 0.64; and for G.B.H., it was 0.83), and therefore, higher thresholds imply larger confidence intervals. As we have found that IFC has higher thresholds ( Figure 4), a direct comparison of the confidence intervals would thus not be fair. Instead, we calculate the ratio of the bootstrapped standard deviation to the threshold. We call this measure threshold uncertainty. We determined threshold uncertainties in the two cases that we have already considered before: (a) the two runs from the two different days are pooled (which means 800 trials per fit) and (b) one psychometric function for each day fitted separately (only 400 trials per fit but twice the number of fits). Median values for both cases are given in Table 1. The median value for 4- and 8-AFC is smaller than that for 2-AFC and 2-IFC. The improvement is substantial as the median uncertainty for 4- and 8-AFC after 400 trials is already smaller than that for 2-AFC and 2-IFC after 800 trials. For the timings of our study, this means that 17 min of 4-AFC provide as much information about threshold as 56 min of 2-IFC—lower thresholds and higher reliability in less than a third of the time.

2-AFC | 4-AFC | 8-AFC | 2-IFCp | 2-IFCf | |
---|---|---|---|---|---|

400 Trials (%) | 5.1 | 3.7 | 3.5 | 5.0 | 6.1 |

800 Trials (%) | 4.3 | 2.7 | 2.5 | 3.9 | 4.5 |

Ratio | 1.19 | 1.37 | 1.40 | 1.28 | 1.35 |

*N*= {50, 100, 150, …, 800} trials, to which we fitted psychometric functions and obtained confidence intervals as described above. The subsamples were taken randomly in proportion from each of the 16 blocks; for example, for a subsample of 100 trials, 12 of the blocks would be represented by 6 trials and 4 blocks would be represented by 7 trials. Thus, for each complete resampling run, we obtain 25 threshold uncertainties, one for each observer and spatial frequency. We repeated this procedure several times (the number of repetitions was different for each

*N*, starting with 50 repetitions for a subsample of 50 trials and going down to 1 repetition for 800 trials), resulting in a distribution of threshold uncertainties for each

*N*. Figure 6 shows the medians and 25–75% quantiles of the threshold uncertainty distributions thus obtained. The solid lines in Figure 6 plot the expected decrease in confidence interval widths based on the intervals for

*N*= 800, scaled ∼1/√

*N*.

*m*-AFC case where

*m*is greater than 2. The reason for this is that the mathematics required for the generalization to the

*m*-alternative case is “rather clumsy” (Luce, 1963) and the numerous assumptions necessary have much less empirical support than those required for yes–no or 2-AFC methods. Luce's choice model, on the other hand, is much simpler and, in most cases, is a viable alternative to SDT. For yes–no, Luce's choice model leads to ROC curves that are very similar to the Gaussian equal variance signal detection model. The few studies that compared the signal detection model to Luce's choice model have found that the signal detection model fits the data slightly better but that Luce's choice model is in any case a very good approximation (Luce, 1963, 1977; Treisman & Faulkner, 1985). However, for our purposes, Luce's choice model has the advantage that it is straightforward to generalize it to

*m*-AFC (Luce, 1963) and that the bias term is easy to interpret. Hence, we will use Luce's choice model to separate sensitivity from response bias, be it temporal or position bias. In this model, the probability to respond with alternative

*i*given that the stimulus

*s*is presented at alternative

*j*is given by

*b*

_{i}can be interpreted as bias terms. If their sum is normalized to 1, they give the a priori probability of the participant to respond with a certain alternative—irrespective of performance level. The

*η*

_{s,i,j}model the sensitivity of the participant to stimulus

*s*. If the sum of all

*η*

_{s,k,j}over

*k*is normalized to 1, we can interpret them as response probability for an unbiased observer. It is usually assumed that the probability that an unbiased observer correctly detects the stimulus does not depend on the alternative

*j*at which it is presented; that is, the sensitivity is the same for all alternatives. If it is further assumed that, for an unbiased observer, the errors are spread evenly among all wrong alternatives, one parameter

*η*

_{s}is enough to model the sensitivity of the participant. In this case, the

*η*

_{s,i,j}are chosen to be

*η*

_{s,j,j}=

*η*

_{s}and

*η*

_{s,i,j}= (1 −

*η*

_{s})/(

*m*− 1) for

*i*≠

*j*. The model can be fitted by maximizing the likelihood of the data and optimizing over the

*m*response bias terms and the sensitivity term for each block. This is what we have done to all participants and to all our methods. One example for observer D.C. is shown in Figure 7.

*detection*tasks and does not generalize to the arguably more common

*discrimination*tasks. Thus, we also explored the influence of 2-IFC, 2-AFC, and 4-AFC on a contrast discrimination task (we are grateful to an anonymous reviewer for suggesting this to us).

^{2}, and the refresh rate was 140 Hz. The experimental procedures with all parameters were identical to those used for detection; the only change was in the stimuli.