**Predicting human performance in perceptual categorization tasks in which category membership is determined by similarity has been historically difficult. This article proposes a novel biologically motivated difficulty measure that can be generalized across stimulus types and category structures. The new measure is compared to 12 previously proposed measures on four extensive data sets that each included multiple conditions that varied in difficulty. The studies were highly diverse and included experiments with both continuous- and binary-valued stimulus dimensions, a variety of different stimulus types, and both linearly and nonlinearly separable categories. Across these four applications, the new measure was the most successful at predicting the observed rank ordering of conditions by difficulty, and it was also the most accurate at predicting the numerical values of the mean error rates in each condition.**

*x*denote the

_{iK}*i*th exemplar in category

*C*, then on trials when

_{K}*x*is presented, activation in sensory unit

_{iK}*j*equals

*j*,

*x*and

_{iK}*γ*captures how tightly sensory units are tuned. Thus,

*j*and striatal unit

*K*is proportional to the summed similarities of object

*K*. Since synaptic strength drives striatal activation, the probability of responding

*K*on a trial when stimulus

*K*, exemplar theory assumes that the subject activates the memory representation of every previously seen exemplar from category

*K*, computes the similarity of the presented stimulus to each of these memory representations, and then sums all these similarities. Thus, exemplar theory predicts that as a subject gains experience at a specific classification task, more and more computation is required on each trial (because there are more terms in the sum). In contrast, the SPC assumes that the sums are encoded in the cortical-striatal synaptic strengths as a result of a reinforcement-learning process. Thus, the SPC assumes that no memory representations are retrieved during the categorization process.

*S*is between-categories similarity and

_{B}*S*is within-category similarity.

_{W}*S*and

_{B}*S*. In particular,

_{W}*S*should equal the similarity of every category exemplar to all exemplars in every contrasting category:

_{B}*R*is the number of contrasting categories,

*n*is the number of exemplars in category

_{K}*K*,

*n*is the number of exemplars in each contrasting category

_{L}*L*, and as in Equation 1,

*A*(

*x*,

_{iK}*x*) is activation in the sensory unit that is maximally excited by stimulus

_{jL}*x*. Similarly,

_{jL}*S*should equal the similarity of every exemplar to all exemplars in the same category:

_{W}*γ*, which is a measure of how tightly tuned the subject's sensory system is to changes in the stimulus. Technically,

*γ*could differ across stimulus dimensions, but in practice such differences would have to be extreme for the SDM to change its predicted ordering of tasks by difficulty. Thus, a single value of

*γ*will suffice in almost all applications. Furthermore, the numerical value of

*γ*could be estimated from separate sensory discrimination data. As we will see, however, the ordinal predictions of the SDM as to which of two (or more) conditions is most difficult typically do not change when

*γ*changes. So the actual numerical value of

*γ*chosen does not appear to be critical. In the empirical applications considered in this article, we compute SDM by averaging across a wide range of

*γ*values.

*D*and

_{W}*D*are some measures of the within-category and between-categories dissimilarities, respectively (e.g., Fukunaga, 2013). Most commonly, dissimilarity is defined as some increasing function of distance. Of these measures, perhaps the most similar to the SDM is the ratio of intra- to extraclass nearest-neighbor measure, often referred to as the

_{B}*N*

_{2}measure (Lorena, Garcia, Lehmann, Souto, & Ho, 2018). The

*N*

_{2}difficulty measure takes the form of Equation 6 with

*N*

_{2}difficulty measure predicts that categories in which the exemplars are more widely distributed are more difficult to learn than categories in which they are tightly clustered. Analogously, the

*N*

_{2}measure defines between-categories separation as

*N*

_{2}measure predicts that classification difficulty decreases with between-categories separation.

*N*

_{2}difficulty measure in two important ways. First, the SDM depends on all category exemplars, whereas

*N*

_{2}assumes that only the nearest neighbors affect difficulty. Leading theories of human category learning assume that classification decisions depend on all previously seen category exemplars—not just the nearest neighbors (e.g., Estes, 1986; Medin & Schaffer, 1978; Nosofsky, 1986).

*N*

_{2}depends on distance, whereas the SDM depends on a nonlinear transformation of distance—namely, similarity. Considerable independent evidence suggests that human classification and generalization are determined primarily by similarity, rather than by distance (e.g., Shepard, 1987). This difference between the SDM and

*N*

_{2}changes the impact that stimulus spacing has on predicted difficulty. The Gaussian similarity function described in Equation 1 has an inflection point at an intermediate distance. The SDM therefore predicts that increasing distances for intermediately spaced stimuli will have a greater impact on difficulty than increasing the separation for either nearby or distant stimuli by the same amount. In contrast, defining difficulty in terms of distance rather than similarity (e.g., as in the

*N*

_{2}measure) predicts that all changes of a fixed distance should have equal effects on classification difficulty.

*N*

_{2}), network measures, dimensionality measures, and class-balance measures.

*C*) as

_{sep}*S*for a two-category condition with categories A and B is defined as

*and Σ*

_{A}*(e.g., as in the experiments by Ashby & Maddox, 1992), we have*

_{B}*T*

_{1}; called T1 by Lorena et al., 2018) is constructed by first centering a hypersphere on each stimulus and setting the radius equal to the distance between that stimulus and the nearest stimulus from the contrasting category. All hyperspheres that are completely contained in another hypersphere are then removed, and the measure is simply the fraction of hyperspheres that remain.

*C*, density, ClsCoef). For measures in this latter group, we generated a predicted rank ordering by inverting the order of the measure. So for example, the condition with the smallest

_{sep}*C*was ranked as most difficult, and that with the largest

_{sep}*C*, least difficult.

_{sep}*γ*. This process was repeated for all values of

*γ*ranging from 5 to 50 in five-step intervals (i.e., 5, 10, 15, …, 45, 50), and the final difficulty score was the average of the scores for all values of

*γ*. In practice, the value of

*γ*can be found by fitting previous results using the same stimuli, but here we are interested in a priori difficulty predictions of the SDM, rather than its ability to account for difficulty post hoc by adjusting the value of

*γ*. The machine-learning measures were computed using the R package provided by Lorena et al. (2018).

*C*difficulty measures to rank order human performance on the five different classification tasks described in Figure 2. A fourth measure was also included (orientation of the optimal bound), but because it failed to make any differential predictions for the majority of the category structures Alfonso-Reese et al. analyzed, it was excluded from comparison here. In all tasks, the stimuli were bar graphs that displayed the numerical values of blood pressure, white-blood-cell count, and serum potassium level of a hypothetical patient. The subject's task was to use these three values to diagnose the patient with either disease A or disease B.

_{sep}*N*

_{2},

*T*

_{1}, and density measures performed best and that the first three of those measures all made identical ordinal predictions—mispredicting only one pair of conditions (Conditions 3 and 5).

*γ*. To investigate this question, we examined how the ordinal predictions of the SDM change as a function of

*γ*. The results are shown in Figure 3, which shows the predicted value of the SDM in each condition across a wide range of different

*γ*values. The rank ordering in Table 1 was computed from the mean SDM from each of these curves. Note that none of the curves cross, which means that the ordinal predictions of the SDM are invariant across different values of

*γ*. We performed similar analyses for each of the other empirical applications considered in the following, and in every case, none of the curves crossed. Thus, at least for the empirical applications considered in this article, the ordinal predictions of the SDM do not depend on the specific numerical value chosen for

*γ*.

^{1}The results are shown in Table 3. The observed accuracies and difficulties were based on performance during the last 300 trials. Note that accuracy was highest in experiment 3, second highest in experiment 1, and lowest in experiment 2, so the observed difficulty ordering was E2 > E1 > E3. This same ordering held for both stimulus types, so in these experiments at least, difficulty depended on category structure but not on the type of stimuli that were used.

*C*, density, and Hubs. The three measures that outperformed the SDM for the Shepard et al. (1961) categories (VOR, CFE, and FBP) and two of the three measures that performed as well as the SDM on the Alfonso-Reese et al. (2002) categories (

_{sep}*T*

_{1}and

*N*

_{2}) all failed to properly rank order the experiments.

*D*and the observed average error rate of human learners

*E*. Then the various measures all predict that

*f*is some strictly increasing (and therefore order-preserving) function. However, none of the measures specify the form of

*f*. This is why we focused on predicted rank orderings (because the predicted rank ordering is the same for any increasing function

*f*). We will use the same strategy here, but in addition we will compare the ability of the most successful measures to predict the observed value of average error rate in all conditions and experiments, under the assumption that

*f*is linear. However, it is important to note that in general, there is no reason to expect

*f*to be linear.

*r*of 0.93 and a Pearson's

*r*

^{2}of 0.87. The nearest neighbor classifier (eNN) is second best, followed by the hyperspheres (

*T*

_{1}),

*N*

_{2}, and density. Thus, despite the complications already described, the SDM accounts for an impressive 87% of the variance in the mean error rates across these studies.

*r*

^{2}for the SDM suggests that the function

*f*from Equation 14 is fairly linear in these applications.

*γ*= 10. In this case the measure accounts for 91% of the variance. This could presumably be increased even further by using values of

*γ*tailored to each stimulus type (because each stimulus has a different visual representation, the neural tuning curves will differ across stimulus types, and therefore

*γ*should also differ). Even so, there are two different reasons that we chose to base the

*r*

^{2}in Table 6 on the mean SDM value across a wide range of

*γ*values. First, none of the other measures includes a free parameter, so to keep the comparisons fair, neither should the SDM. Second, the goal of this article is to develop a difficulty measure that makes accurate a priori predictions of difficulty.

*C*), as well as eight previously used measures from the machine-learning literature (VOR, CFE, FBP, eNN,

_{sep}*T*

_{1}, density, ClsCoef, and Hubs). All of these measures were compared on four extensive data sets that each included multiple conditions that varied in difficulty. The studies were highly diverse and included experiments with both continuous- and binary-valued stimulus dimensions, a variety of different stimulus types, and both linearly and nonlinearly separable categories. Across these four applications, the SDM was the most successful measure at predicting the observed rank ordering of conditions by difficulty, with an average Spearman's

*r*of 0.87, and it was also the most accurate measure of the six tested at predicting the numerical values of the mean error rates in each condition (accounting for 87% of the variance in error rates across all conditions).

^{2}Thus, the Shepard et al. conditions are not representative of real-world categorization tasks. More research on how people learn the Shepard et al. categories is clearly needed. In any case, our hypothesis is that the SDM will accurately predict the difficulty of any categories learned procedurally.

*γ*), whereas the other measures do not. This is because the SDM was constructed to predict difficulty for human learners, whereas all other measures are meant to predict difficulty of an optimal classifier (i.e., an ideal observer). The optimal classifier operates noise free, whereas even the best human learner must deal with perceptual noise. The

*γ*parameter measures that noise (e.g., note from Figure 3 that difficulty increases with

*γ*).

*γ*(e.g., see Figure 3). Even so, the inclusion of

*γ*in the measure allows the SDM to make some unique predictions relative to the other measures. For example, adding a noise mask to the stimulus display should increase the number of visual neurons that respond and therefore increase

*γ*. Thus, the SDM predicts that adding a noise mask increases difficulty. Similarly, the SDM predicts that uniformly contracting the entire stimulus space will also increase difficulty. In contrast, none of the other measures predict that either of these manipulations will have any effect on difficulty, because adding a mask or uniformly contracting the space should not affect the performance of the optimal classifier.

*γ*to be added.

*Perception & Psychophysics*, 64 (4), 570–583.

*Psychological Review*, 105 (3), 442–481.

*Journal of Experimental Psychology: Human Perception and Performance*, 18 (1), 50–71.

*Psychological Review*, 124 (4), 472–482.

*Handbook of categorization in cognitive science*(2nd ed.; pp. 157–188). San Diego, CA: Elsevier.

*Psychonomic Bulletin & Review*, 6 (3), 363–378.

*Communications in Statistics-Theory and Methods*, 19 (1), 221–278.

*Psychonomic Bulletin & Review*, 22, 1598–1613.

*Proceedings of the 38th Annual Conference of the Cognitive Science Society*(pp. 69–74). Seattle, WA: Cognitive Science Society.

*Perception & Psychophysics*, 68 (6), 1013–1026.

*Psychonomic Bulletin & Review*, 18 (1), 96–102.

*Cognitive Psychology*, 18 (4), 500–549.

*Nature*, 407 (6804), 630–633.

*Cognition*, 93 (3), 199–224.

*Introduction to statistical pattern recognition*(2nd ed.). Cambridge, MA: Elsevier.

*Psychological Review*, 99 (1), 22–44.

*Association for the Advancement of Artificial Intelligence 1998 Proceedings*(pp. 671–676). Palo Alto, CA: Association for the Advancement of Artificial Intelligence.

*Sleep*, 32 (11), 1439–1448.

*Cognitive, Affective, & Behavioral Neuroscience*, 14 (2), 769–781.

*Psychological Review*, 85 (3), 207–238.

*Journal of Experimental Psychology: Learning, Memory, and Cognition*, 10 (1), 104–114.

*Journal of Experimental Psychology: General*, 115 (1), 39–57.

*Memory & Cognition*, 22 (3), 352–369.

*Journal of Experimental Psychology*, 77 (3, part 1), 353–363.

*Nature Human Behaviour*, 2 (7), 500–506.

*Memory & Cognition*, 2 (3), 549–553.

*Science*, 237 (4820), 1317–1323.

*Psychological Monographs: General and Applied*, 75 (13), 1–42.

*Journal of Experimental Psychology: General*, 133 (3), 398–414.

*Psychonomic Bulletin & Review*, 8 (1), 168–176.

^{1}Experiments 1 and 2 included a third condition in which the stimuli were two connected line segments that varied across trials in length. However, Ashby and Maddox (1992) did not include those stimuli in their experiment 3, and so those conditions are not considered here. Even so, the difficulty ordering for the excluded conditions was the same as for the other conditions, so the only effect of including the line-segment data would be to slightly change the percent correct listed for experiment 1 in Table 3.

^{2}One commonly cited counterexample is that animals either have wings or they do not. However, this binary categorization is the result of a decision. Perceptually, there is enormous variability in the structures that might be labeled wings. For example, consider the differences among eagles, penguins, and seahorses.