The workhorse behavioral paradigm in many areas of human and animal research (e.g., perception, memory, decision making) is the two-alternative identification task. On each trial of an experiment, the subject is presented with a stimulus drawn from one of two categories, a and b, and is required to identify the category. The category identified by the subject (the subject's response) can be represented with capital letters A and B. Thus, on each trial there are four possibilities: The subject responds B when the category is b (B|b), the subject responds B when the category is a (B|a), the subject responds A when the category is a (A|a), or the subject responds A when the category is b (A|b). We note that the common two-alternative forced-choice task can be regarded as a special case where category a is a pair of stimuli in one spatial or temporal order and category b is a pair of stimuli in the opposite order.
Signal-detection theory (SDT) is the standard theoretical framework for interpreting the data measured in the two-alternative identification task (e.g., Green & Swets,
1974; Wickens,
2001). The theory assumes that the end result of all the processing in the organism, prior to the behavioral response, can be represented with a single, probabilistic decision variable that is described by one normal distribution if the stimulus is from category
b and another normal distribution if the stimulus is from category
a. The behavioral response is assumed to be generated by comparing the value of the decision variable on the trial to a criterion placed along the decision-variable axis. The subject makes one response if the value falls below the criterion and makes the other response if it falls above the criterion. In Bayesian statistical models, this decision variable would typically correspond to the log likelihood ratio—the log of the ratio of the probability of the modeled pattern of neural activity given category
b to the probability given category
a. However, it is important to point out that basic SDT makes no assumption about how the decision variable is computed. Its purpose is only to provide a principled interpretation of the four possible outcomes of trials in the identification task. Indeed, basic SDT can provide a principled interpretation even if there is no location within the nervous system where a single decision variable is represented.
Within the SDT framework, the probabilities of the four possible stimulus–response outcomes are determined by two parameters: the number
d′ of standard deviations separating the two distributions, and the decision criterion
Display Formula\(\def\upalpha{\unicode[Times]{x3B1}}\)\(\def\upbeta{\unicode[Times]{x3B2}}\)\(\def\upgamma{\unicode[Times]{x3B3}}\)\(\def\updelta{\unicode[Times]{x3B4}}\)\(\def\upvarepsilon{\unicode[Times]{x3B5}}\)\(\def\upzeta{\unicode[Times]{x3B6}}\)\(\def\upeta{\unicode[Times]{x3B7}}\)\(\def\uptheta{\unicode[Times]{x3B8}}\)\(\def\upiota{\unicode[Times]{x3B9}}\)\(\def\upkappa{\unicode[Times]{x3BA}}\)\(\def\uplambda{\unicode[Times]{x3BB}}\)\(\def\upmu{\unicode[Times]{x3BC}}\)\(\def\upnu{\unicode[Times]{x3BD}}\)\(\def\upxi{\unicode[Times]{x3BE}}\)\(\def\upomicron{\unicode[Times]{x3BF}}\)\(\def\uppi{\unicode[Times]{x3C0}}\)\(\def\uprho{\unicode[Times]{x3C1}}\)\(\def\upsigma{\unicode[Times]{x3C3}}\)\(\def\uptau{\unicode[Times]{x3C4}}\)\(\def\upupsilon{\unicode[Times]{x3C5}}\)\(\def\upphi{\unicode[Times]{x3C6}}\)\(\def\upchi{\unicode[Times]{x3C7}}\)\(\def\uppsy{\unicode[Times]{x3C8}}\)\(\def\upomega{\unicode[Times]{x3C9}}\)\(\def\bialpha{\boldsymbol{\alpha}}\)\(\def\bibeta{\boldsymbol{\beta}}\)\(\def\bigamma{\boldsymbol{\gamma}}\)\(\def\bidelta{\boldsymbol{\delta}}\)\(\def\bivarepsilon{\boldsymbol{\varepsilon}}\)\(\def\bizeta{\boldsymbol{\zeta}}\)\(\def\bieta{\boldsymbol{\eta}}\)\(\def\bitheta{\boldsymbol{\theta}}\)\(\def\biiota{\boldsymbol{\iota}}\)\(\def\bikappa{\boldsymbol{\kappa}}\)\(\def\bilambda{\boldsymbol{\lambda}}\)\(\def\bimu{\boldsymbol{\mu}}\)\(\def\binu{\boldsymbol{\nu}}\)\(\def\bixi{\boldsymbol{\xi}}\)\(\def\biomicron{\boldsymbol{\micron}}\)\(\def\bipi{\boldsymbol{\pi}}\)\(\def\birho{\boldsymbol{\rho}}\)\(\def\bisigma{\boldsymbol{\sigma}}\)\(\def\bitau{\boldsymbol{\tau}}\)\(\def\biupsilon{\boldsymbol{\upsilon}}\)\(\def\biphi{\boldsymbol{\phi}}\)\(\def\bichi{\boldsymbol{\chi}}\)\(\def\bipsy{\boldsymbol{\psy}}\)\(\def\biomega{\boldsymbol{\omega}}\)\(\def\bupalpha{\unicode[Times]{x1D6C2}}\)\(\def\bupbeta{\unicode[Times]{x1D6C3}}\)\(\def\bupgamma{\unicode[Times]{x1D6C4}}\)\(\def\bupdelta{\unicode[Times]{x1D6C5}}\)\(\def\bupepsilon{\unicode[Times]{x1D6C6}}\)\(\def\bupvarepsilon{\unicode[Times]{x1D6DC}}\)\(\def\bupzeta{\unicode[Times]{x1D6C7}}\)\(\def\bupeta{\unicode[Times]{x1D6C8}}\)\(\def\buptheta{\unicode[Times]{x1D6C9}}\)\(\def\bupiota{\unicode[Times]{x1D6CA}}\)\(\def\bupkappa{\unicode[Times]{x1D6CB}}\)\(\def\buplambda{\unicode[Times]{x1D6CC}}\)\(\def\bupmu{\unicode[Times]{x1D6CD}}\)\(\def\bupnu{\unicode[Times]{x1D6CE}}\)\(\def\bupxi{\unicode[Times]{x1D6CF}}\)\(\def\bupomicron{\unicode[Times]{x1D6D0}}\)\(\def\buppi{\unicode[Times]{x1D6D1}}\)\(\def\buprho{\unicode[Times]{x1D6D2}}\)\(\def\bupsigma{\unicode[Times]{x1D6D4}}\)\(\def\buptau{\unicode[Times]{x1D6D5}}\)\(\def\bupupsilon{\unicode[Times]{x1D6D6}}\)\(\def\bupphi{\unicode[Times]{x1D6D7}}\)\(\def\bupchi{\unicode[Times]{x1D6D8}}\)\(\def\buppsy{\unicode[Times]{x1D6D9}}\)\(\def\bupomega{\unicode[Times]{x1D6DA}}\)\(\def\bupvartheta{\unicode[Times]{x1D6DD}}\)\(\def\bGamma{\bf{\Gamma}}\)\(\def\bDelta{\bf{\Delta}}\)\(\def\bTheta{\bf{\Theta}}\)\(\def\bLambda{\bf{\Lambda}}\)\(\def\bXi{\bf{\Xi}}\)\(\def\bPi{\bf{\Pi}}\)\(\def\bSigma{\bf{\Sigma}}\)\(\def\bUpsilon{\bf{\Upsilon}}\)\(\def\bPhi{\bf{\Phi}}\)\(\def\bPsi{\bf{\Psi}}\)\(\def\bOmega{\bf{\Omega}}\)\(\def\iGamma{\unicode[Times]{x1D6E4}}\)\(\def\iDelta{\unicode[Times]{x1D6E5}}\)\(\def\iTheta{\unicode[Times]{x1D6E9}}\)\(\def\iLambda{\unicode[Times]{x1D6EC}}\)\(\def\iXi{\unicode[Times]{x1D6EF}}\)\(\def\iPi{\unicode[Times]{x1D6F1}}\)\(\def\iSigma{\unicode[Times]{x1D6F4}}\)\(\def\iUpsilon{\unicode[Times]{x1D6F6}}\)\(\def\iPhi{\unicode[Times]{x1D6F7}}\)\(\def\iPsi{\unicode[Times]{x1D6F9}}\)\(\def\iOmega{\unicode[Times]{x1D6FA}}\)\(\def\biGamma{\unicode[Times]{x1D71E}}\)\(\def\biDelta{\unicode[Times]{x1D71F}}\)\(\def\biTheta{\unicode[Times]{x1D723}}\)\(\def\biLambda{\unicode[Times]{x1D726}}\)\(\def\biXi{\unicode[Times]{x1D729}}\)\(\def\biPi{\unicode[Times]{x1D72B}}\)\(\def\biSigma{\unicode[Times]{x1D72E}}\)\(\def\biUpsilon{\unicode[Times]{x1D730}}\)\(\def\biPhi{\unicode[Times]{x1D731}}\)\(\def\biPsi{\unicode[Times]{x1D733}}\)\(\def\biOmega{\unicode[Times]{x1D734}}\)\(\gamma \). Thus, without loss of generality, the two normal distributions can be represented as normal distributions with standard deviations of 1.0 and means at
Display Formula\( - d^{\prime} /2\) and
Display Formula\(d^{\prime} /2\) (
Figure 1). In SDT, the parameter
Display Formula\(d^{\prime} \) represents the intrinsic discriminability of the two categories, independent of the particular decision criterion adopted by the subject.
A typical use of SDT is to estimate the subject's discriminability and decision criterion from the proportions of trials falling into the four possible stimulus–response outcomes already mentioned. Because the two proportions for each category must sum to 1.0, estimates of discriminability and criterion are obtained from just two proportions—for example,
p(
B|
b) and
p(
B|
a). In the special case of a yes–no detection task these two proportions represent the proportion of hits and false alarms (Green & Swets,
1974). These two proportions correspond to the areas under the two normal distributions above the decision criterion.
The estimates of the discriminability and the criterion are obtained by solving the following pair of equations:
\begin{equation}\tag{1}p\left( {B\left| b \right.} \right) = \Phi \left( {{{d^{\prime} } \over 2} - \gamma } \right)\end{equation}
\begin{equation}\tag{2}p\left( {B\left| a \right.} \right) = \Phi \left( { - {{d^{\prime} } \over 2} - \gamma } \right),\!\end{equation}
where
Display Formula\(\Phi \left( \cdot \right)\) is the standard normal integral function. Two subjects could differ in their intrinsic ability to identify the categories (the values of
Display Formula\(d^{\prime} \)), in their decision biases (the values of
Display Formula\(\gamma \)), or in both. SDT provides a principled framework for measuring these quantities, allowing better comparison of performance across subjects or between subjects and models.
The main goal here is to describe an extension of the standard SDT framework that allows measurement of two additional quantities (two decision-variable correlations) that provide increased power for comparing subjects and testing models. This is a straightforward extension related to existing signal-detection analyses, including the estimation of classification images (Ahumada & Lovell,
1971; Ahumada,
1996; for a review, see Murray,
2011), estimation of the ratio of external to internal noise (Spiegel & Green,
1981; Burgess & Colborne,
1988), and estimation of choice probabilities in neurophysiology experiments (Britten, Newsome, Shadlen, Celebrini, & Movshon,
1996; Haefner, Gerwinn, Macke, & Bethge,
2013; Pitkow, Liu, Angelaki, DeAngelis, & Pouget,
2015; Michelson, Pillow, & Seidemann,
2017;
Seidemann & Geisler, in press).
Importantly, the extension proposed here does not require any special or new experimental design features; it can be applied to data from almost any two-alternative identification experiment in which the stimuli within each of the two categories are not physically identical on each trial. The only requirement is that the specific category and stimulus presented on each trial be known. There are many published studies where the experimenters saved the specific stimuli, or the information necessary to generate the stimuli (e.g., random-number seeds); however, there are also many studies where that is not the case. Another goal of this report is to emphasize the value of saving the stimuli, or the information for regenerating them, and the value of making the specific stimuli and responses publicly available.
The extended SDT framework is illustrated in
Figure 2. We assume that a subject and an arbitrary model (or two subjects, or the same subject on two occasions) are performing the same task with the same stimuli. Thus, we can represent the subject's normalized decision variable along the horizontal axis and the model's (or the second subject's) normalized decision variable along the vertical axis. The subject will have some arbitrary discriminability
Display Formula\({d^{\prime} _s}\) and some arbitrary decision criterion
Display Formula\({\gamma _s}\). Similarly, the model will have some arbitrary discriminability
Display Formula\({d^{\prime} _m}\) and some arbitrary decision criterion
Display Formula\({\gamma _m}\). As usual,
Equations 1 and
2 can be used to estimate the discriminability and decision criterion of the subject and model from their respective response proportions
Display Formula\({p_s}\left( {B\left| b \right.} \right)\),
Display Formula\({p_s}\left( {B\left| a \right.} \right)\) and
Display Formula\({p_m}\left( {B\left| b \right.} \right)\),
Display Formula\({p_m}\left( {B\left| a \right.} \right)\).
An accurate model would have the same discriminability as the subject under the same conditions. However, even if a model has the same discriminability as the subject, the model and subject may vary in how correlated their responses are on a trial-by-trial basis. In the SDT framework, this is represented by the correlation between the decision variables. The better the model (or the more similar the two subjects), the higher is the correlation between the values of their decision variables (i.e., the more accurate the trial-by-trial predictions). These decision-variable correlations (DVCs) need not be the same for the two categories of stimuli. Thus, by measuring DVCs in addition to discriminabilities and decision criteria, we should have more power to discriminate between models, estimate model parameters, and characterize subjects based on individual differences.
The joint distribution of the two decision variables is given by one bivariate normal distribution for category
a and another bivariate normal distribution for category
b (represented by the ellipses in
Figure 2). All the standard deviations are 1.0, because the decision variables are normalized. The mean for category
a is at
Display Formula\(\left( { - {{d^{\prime} }_s}/2, - {{d^{\prime} }_m}/2} \right)\) and for category
b is at
Display Formula\(\left( {{{d^{\prime} }_s}/2,{{d^{\prime} }_m}/2} \right)\). The DVC for category
a is
Display Formula\({\rho _a}\) and for category
b is
Display Formula\({\rho _b}\). Formally, the two distributions are given by
\begin{equation}\tag{3}bv{n_b}\left( {{z_s},{z_m}} \right) = \phi \left( {{z_s} - {{{{d^{\prime} }_s}} \over 2},{z_m} - {{{{d^{\prime} }_m}} \over 2};{\rho _b}} \right)\end{equation}
\begin{equation}\tag{4}bv{n_a}\left( {{z_s},{z_m}} \right) = \phi \left( {{z_s} + {{{{d^{\prime} }_s}} \over 2},{z_m} + {{{{d^{\prime} }_m}} \over 2};{\rho _a}} \right),\!\end{equation}
where
Display Formula\(\phi \left( {x,y;\rho } \right)\) is the standard bivariate normal density function
\begin{equation}\tag{5}\phi \left( {x,y;\rho } \right) = {1 \over {2\pi \sqrt {1 - {\rho ^2}} }}\exp \left[ { - {{{x^2} - 2\rho xy + {y^2}} \over {2\left( {1 - {\rho ^2}} \right)}}} \right].\end{equation}
Once the discriminability and criterion values have been estimated from
Equations 1 and
2, the DVCs can be estimated from four proportions that are directly available from the subject and model responses to the stimuli in each category. For category
a they are the proportion of times the subject and the model both responded
A,
p(
AA|
a); the proportion of times the subject responded
A and the model responded
B,
p(
AB|
a); the proportion of times the subject responded
B and the model responded
A,
p(
BA|
a); and the proportion of times the subject and the model both responded
B,
p(
BB|
a). Similarly, for category
b, the proportions are
p(
AA|
b),
p(
AB|
b),
p(
BA|
b) and
p(
BB|
b). These proportions correspond to the volume under the bivariate normal distribution within the four quadrants defined by the two decision criteria. As long as no more than one of the four proportions is zero, then it is straightforward to obtain the maximum-likelihood estimate of the DVC using
Equations 3–
5 (see
Appendix). This estimate of the DVC is Pearson's
tetrachoric correlation coefficient (Pearson,
1900; Stuart & Ord,
1991) applied in the SDT framework.
Maximum-likelihood estimation of the DVC from these four proportions is appropriate when only the categorical responses (
A or
B) are available, as when estimating DVCs between two subjects (or within the same subject on different occasions). However, for many (but not all) models, a decision variable is explicitly computed for each stimulus presentation. In this case, more reliable maximum-likelihood estimates can be obtained by directly using the values of the model's decision variable (see
Appendix). However, we emphasize that DVCs can be measured even for a model that does not produce an explicit decision variable (e.g., some neural-network models).
The distributions of the decision variables for the subject and model (or another subject) are the marginal distributions of the bivariate distribution, and hence the DVCs do not necessarily depend on the discriminabilities and decision criteria estimated with basic SDT. The ellipses in
Figure 2 represent a moderate positive correlation. If the decision variables were uncorrelated, then the ellipses would become circles. If the decision variables were negatively correlated, the ellipses would be rotated 90°. In principle, all of these correlations are possible for the same discriminabilities and decision criteria. For example, if a model and subject (or two subjects) are using different cues to perform a task, then it is possible that a stimulus that is easy for the model will be hard for the subject, and vice versa.
An important implication of this fact is that a given DVC can be consistent with many different values of standard measures of the behavioral correlation between the sequences of responses of the model and subject, or between two subjects. For example,
Figure 3 shows how four different behavioral-correlation measures (defined in the
Appendix) vary with the subject's decision criterion
Display Formula\({\gamma _s}\) and discriminability
Display Formula\({d^{\prime} _s}\), while the decision-variable correlation
Display Formula\({\rho _b}\) for category
b is held fixed at 0.5 (yellow curves). The simplest correlation measure is the fraction of trials in which the subject and model are in agreement (blue curves). Perhaps the best-known and most common correlation measure for binary data is the phi correlation (also introduced by Pearson; Cramer,
1946, p. 282), which is equivalent to Matthews's (
1975) correlation coefficient (orange curves). Another popular correlation measure is Cohen's (
1960) kappa coefficient (gray curves). Finally, a common measure in the neurophysiology literature is choice probability (Britten et al.,
1996; green curves). In all cases, the behavioral correlations vary substantially while the decision-variable correlation is held fixed. (We note that all the curves in
Figure 3A are flipped about the vertical axis at 0 if the measures are computed for category
a rather than category
b.) In the extended SDT framework, the fundamental quantities are the discriminabilities, decision criteria, and DVCs. The overall accuracy levels and behavioral correlations depend in a rather complex way on all six of these fundamental quantities.
In other words, the extended SDT framework plays an analogous role in interpreting trial-by-trial correlations to the role that the standard SDT framework plays in interpreting accuracy. Standard SDT recognizes the fact that percent correct is not the best measure of accuracy because percent correct is dependent on the value of the decision criterion (every point on a given receiver operating characteristic curve corresponds to a different percent-correct value). Extended SDT recognizes the fact that previous trial-by-trial measures are not the best measures of trial-by-trial correlation because they are dependent on the value of decision criterion and the discriminability.