Vision researchers are interested in mapping complex physical stimuli to perceptual dimensions. Such a mapping can be constructed using multidimensional psychophysical scaling or ordinal embedding methods. Both methods infer coordinates that agree as much as possible with the observer’s judgments so that perceived similarity corresponds with distance in the inferred space. However, a fundamental problem of all methods that construct scalings in multiple dimensions is that the inferred representation can only reflect perception if the scale has the correct dimension. Here we propose a statistical procedure to overcome this limitation. The critical elements of our procedure are i) measuring the scale’s quality by the number of correctly predicted triplets and ii) performing a statistical test to assess if adding another dimension to the scale improves triplet accuracy significantly. We validate our procedure through extensive simulations. In addition, we study the properties and limitations of our procedure using “real” data from various behavioral datasets from psychophysical experiments. We conclude that our procedure can reliably identify (a lower bound on) the number of perceptual dimensions for a given dataset.

*multidimensional scaling*(MDS) (Shepard, 1962; Kruskal, 1964a, 1964b)), which was recently accompanied by

*ordinal embedding*methods from machine learning (Roads & Mozer, 2019; Haghiri, Wichmann, & von Luxburg, 2020). In contrast with other scaling approaches, such as the popular maximum-likelihood difference scaling (MLDS) (Knoblauch & Maloney, 2012), MDS and ordinal embedding can estimate multiple perceived dimensions. An approach related to MLDS but for multiple dimensions, maximum-likelihood conjoint measurement (Ho et al., 2008), tries to obtain interpretable dimensions by additional assumptions (e.g. monotonicity and independence of perceived dimensions; Radonjić, Cottaris, & Brainard, 2019), whereas MDS and ordinal embedding is more exploratory in trying to find the scale that best fits the data.

*stress*term, measuring the agreement between the distances in the scale and dissimilarity ratings collected for (all) stimulus pairs in a psychophysical experiment.

*triplet*(or triad) task is reasonably common (Wills, Agarwal, Kriegman, & Belongie, 2009; Devinck & Knoblauch, 2012; Bonnardel et al., 2016; Lagunas et al., 2019; Haghiri et al., 2020; Toscani et al., 2020). In the triplet task, observers are presented with three different stimuli, usually simultaneously, of which one is called the

*anchor*. The observer chooses one of the other two stimuli perceived as most similar (or dissimilar) to the anchor, resulting in the (

*anchor, near, far*)—a triplet of stimulus indices. The triplet task comes in different flavours depending on the instructions, for example, “Which is the odd one out?” (Hebart et al., 2020); or the opposite question “Which appears most central?” (Kleindessner & von Luxburg, 2017). Sometimes observers are presented with more stimuli and are asked to make multiple decisions, as, for example, in (Roads & Mozer, 2019; Roads & Love, 2021). From an ordinal embedding perspective, these differences are irrelevant—the responses can be mapped to triplets (interested readers can find details about the mappings in the Supplementary Material A).

*agreement*between triplets (

*anchor, near, far*) and the corresponding distances in terms of Equation 1. Similarly, we call triplets where the inequality does not apply

*disagreeing triplets*.

*stress*-function on a set of triplets \(T=\lbrace (i_1, j_1, k_1), \dots , (i_m, j_m, k_m)\rbrace\), where \(i, j,\hbox{ and } k\) are stimulus indices whose order correspond with the trial response

*anchor*,

*near*, and

*far*. The numerical properties of this stress-function (e.g., smoothness) are more desirable for optimization than the agreement count (Equation 1). Different ordinal embedding algorithms mainly differ in their stress-function (Agarwal et al., 2007; van der Maaten & Weinberger, 2012; Terada & Luxburg, 2014; Jain et al., 2016). The algorithm that we use in this work is called

*soft ordinal embedding*(Terada & Luxburg, 2014) and has been shown to result in very accurate reconstruction in a large scale benchmarking study (Vankadara, Haghiri, Lohaus, Wahab, & von Luxburg, 2020). Soft ordinal embedding's stress function (Equation 2) is “soft” in the sense that disagreeing triplets are included based on the size of their squared error. Trivial solutions with all-zero coordinates are prevented by enforcing a minimal distance difference (“\(\dots {}+1\)”). Once a coordinate triplet agrees by this minimal distance it does not increase the stress (“\(\max [0,\dots ]\)”).

*in general*. A more direct measure of fit is the proportion of triplets that agree in terms of Equation 1 with the scale \({\boldsymbol\psi }\), called the

*triplet accuracy*(Equation 3).

*training triplet accuracy*(train accuracy) and

*test triplet accuracy*(test accuracy). In contrast with the training accuracy, the test accuracy helps to distinguish reasonable from

*noisy*responses. Noise in the sensory system, lapses, or other human imperfections, summarized as

*judgment noise*, might cause erroneous triplets that disagree with the majority of responses. However, the test triplets likely contain different erroneous triplets; this process results in a lower test accuracy that is a better estimate of the general or “true” fit than the (spuriously high) training accuracy.

*\(k\)-fold cross-validation*(compare Hastie, Tibshirani, & Friedman, 2009), splits the dataset into \(k\) equal-sized parts and estimates \(k\) scales to calculate \(k\) accuracies. For every iteration, \(k-1\) different parts are used for training, while the \(k\)-th part is used for testing the accuracy. To sample more than \(k\) accuracies,

*\(r\)-repeated cross-validation*repeats the \(k\) folds on \(r\) shuffled versions of the dataset. The mean of the resulting \(k\cdot r\) accuracies approximates the test accuracy. Please note that these cross-validated scales are used for accuracy estimation—the “final” scale for visualizing the observer’s perception is estimated from all triplets.

*underfitting*. For example, red is perceived as more similar to yellow than blue; in the one-dimensional fit, red is farther from yellow than blue. The three-dimensional scale might perfectly represent all triplets, but also all the erroneous ones (zero training accuracy)—the scale is

*overfitting*. The erroneous triplets, caused by judgement noise, disagree with most triplets and thus with the most accurate scale.

*ordinal capacity*metric that differs between metric spaces, for example, one-dimensional Euclidean, two-dimensional Euclidean, and four-dimensional hyperbolic space. Calculating this ordinal capacity requires sorting the distances between \(n\) stimuli which requires in the limit up to \(n\) times more triplets than estimating a (low-dimensional) scale \(\mathcal {O} (n^2 \log n )\) instead of \(\mathcal {O} (d n \log n )\) with \(d\ll n\); (Haghiri et al., 2020). In an exemplary case with a two-dimensional scale of 40 stimuli, one requires approximately 20 times more triplets to determine the ordinal capacity than to estimate the scale. This additional experimental effort—just for determining the dimensionality—is unacceptable.

*t*test whose test statistic is modified for the use of cross validation.

*t*test assumes normally distributed and independent samples. Although accuracy samples are binomial and thus approximately normally distributed (see Dietterich, 1998, for the argument and see the Supplementary Material B for simulations), their independence is violated by the data overlap in cross-validation folds. For

*t*tests with \(k\)-fold cross-validated datasets (Nadeau & Bengio, 2003) proposed the correction factor \(\frac{1}{k - 1}\) in the test statistic:

*t*test: The \(p\)-value is the probability density of a Student \(t\) distribution with \(rk-1\) degrees of freedom at our \(t\)-value, calculated with Equation 4.

*Holm–Bonferroni method*(Holm’s step-down procedure, Holm, 1979), to correct the significance threshold \(\alpha\) of the neighboring dimension tests. The Holm–Bonferroni method is more powerful than basic Bonferroni (fewer false-negative test results) and hardly more complicated, but otherwise shares Bonferroni’s benefits, such as a lack of distributional and dependence assumptions. Whereas the Bonferroni correction divides the significance threshold \(\alpha\) by the number of tests \(m\) (\(\alpha _{\rm corrected} = \frac{\alpha }{m}\)) the Holm–Bonferroni correction considers the ascending ranking \(r\) of all test’s

*p*-values: \(\alpha _r = \frac{\alpha }{m - r + 1}\). The strictest threshold \(\frac{\alpha }{m}\) is just used for the smallest \(p\)-value, such that fewer “gain in accuracy” tests are erroneously rejected with the Holm–Bonferroni method, although the increase in accuracy usually decreases with increasing dimension.

- (a) Estimate scales for \(d=1\) to \(m + 1\):
- (1) Estimate and cross-validate \(k\) psychophysical scales in \(d\) dimensions with soft ordinal embedding, repeat \(r\) times on shuffled triplets.
- (2) Collect triplet accuracies \({\bf acc}_d\in \mathbb {R}^{rk}\) from the \(r\)-repeated \(k\)-fold cross-validation.

- (b) Test scales pairwise for \(d=1\) to \(m\):
- (3) Calculate the \(p_d\) value of an accuracy gain \(H_d: {\bf acc}_{d+1}-{\bf acc}_d \gt 0\) with the Student \(t\)-distribution PDF (\(df=kr-1\)) at the \(t\)-value of Equation 4.

- (c) Combine the tests for \(d=1\) to \(m\):
- (4) Accept \(H_d\) if \(p_d \lt \frac{\alpha }{m - R(p_d) + 1}\), \(R\) is the rank of \(p_d\).
- (5) If \(H_d\) rejected, return “\(d\) dimensions.”

- (d) If no \(H_d\) has been rejected, return “at least \(m+1\) dimensions.”

*normal scales*in the following. In addition to these normal scales, results of two ground-truth scales inspired by actual psychophysical scales, namely, a circle-like hue (Ekman, 1954) and a helix-like pitch scale (Shepard, 1965) are available in the Supplementary Material E.

*noise ceiling*, the best possible generalization accuracy considering the fraction of triplets that became incompatible through the (simulated) judgment noise.

*p*-values in Figure 5). We expect significant increases until the dimensionality equals the ground-truth scale’s dimensionality. Figure 6 depicts how many of our procedure repetitions violate this expectation. The left plot shows that almost no incorrect accuracy gain was detected, which is expected from the conservative multiple-testing correction and the robust decrease of test accuracy after the ground-truth dimensionality in the previous plots. The dark colors in the right plot indicate that the statistical test rejected the accuracy gain hypothesis for some scaling dimensions lower than the ground truth; the procedure underestimated the dimensionality. These underestimates occur more frequently for high noise settings (Figure 6, right) and for small datasets and high ground-truth dimensionality (see Supplementary Materials F). Overall, the repeated runs of our method confirm the large influence of the noise magnitude on scaling accuracy and dimensionality underestimation.

*simulated trials*. However, the intended application of our procedure is dimension estimation on

*behavioral trials*from psychophysical experiments. Thus, we also investigated behavioral datasets. In contrast with simulations, behavioral data have no ground truth, but just more or less evidence about the “correct” dimensionality.

*reach, grain*, and

*coherence*of the Eidolon factory (Koenderink et al., 2017). Triplets of 100 such distorted stimuli of the same landscape photography were generated by (Haghiri, Rubisch, Geirhos, Wichmann, & von Luxburg, 2019) for their laboratory experiment. Their observers were asked 6,000 random triplet questions and responded to almost all of them (first observer, 6,000 responses; second, 5,996; third, 5,999).

*super-space*(Carroll & Chang, 1970), composed of the multiple observer’s (cognitive) decision criteria.

*Artificial intelligence and statistics*(pp. 11–18). AISTATS, San Juan, Puerto Rico.

*Journal of Vision,*17(1), 37, doi:10.1167/17.1.37. [CrossRef] [PubMed]

*Biometrical Journal,*36(1), 1–15, doi:10.1002/bimj.4710360102. [CrossRef]

*Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze,*8, 3–62.

*Journal of the Optical Society of America A,*33(3), A30, doi:10.1364/JOSAA.33.000A30. [CrossRef]

*Modern multidimensional scaling: Theory and applications*. New York: Springer Science & Business Media, doi:10.1007/0-387-28981-X.

*Journal of the Optical Society of America A,*31(4), A385–A393, doi:10.1364/JOSAA.31.00A385. [CrossRef]

*Advances in knowledge discovery and data mining*(Vol. 3056, pp. 3–12). Berlin, Heidelberg: Springer Berlin Heidelberg.

*Journal of the American Statistical Association,*84(407), 792–796. [CrossRef]

*Information Sciences,*328, 26–41, doi:10.1016/j.ins.2015.08.029. [CrossRef]

*Psychometrika,*35(3), 283–319, doi:10.1007/BF02310791. [CrossRef]

*IEEE Transactions on Visualization and Computer Graphics,*20(12), 1933–1942, doi:10.1109/TVCG.2014.2346978. [CrossRef] [PubMed]

*Journal of Vision,*12(3), 19–19, doi:10.1167/12.3.19. [CrossRef] [PubMed]

*Neural Computation,*10(7), 1895–1923, doi:10.1162/089976698300017197. [CrossRef] [PubMed]

*Journal of Psychology,*38(2), 467–474, doi:10.1080/00223980.1954.9712953. [CrossRef]

*Annual Review of Vision Science,*3(1), 365–388, doi:10.1146/annurev-vision-102016-061429. [CrossRef] [PubMed]

*Practical methods of optimization*(2nd ed.). Hoboken, NJ: Wiley-Interscience.

*Annual Review of Psychology,*39(1), 169–200. [CrossRef] [PubMed]

*Psychophysics: The fundamentals*(pp. 183–206). Abingdon-on-Thames, UK: Taylor & Francis.

*Psychophysics: The fundamentals*(pp. 183–206). Abingdon-on-Thames, UK: Taylor & Francis, doi:10.4324/9780203774458.

*arXiv:1905.07234 [cs, stat]*.

*Journal of Vision,*20(9), 14, doi:10.1167/jov.20.9.14. [CrossRef] [PubMed]

*The elements of statistical learning: Data mining, inference, and prediction*, 2nd ed. (p. 763). New York: Springer-Verlag, doi:10.1007/978-0-387-84858-7.

*Nature Human Behaviour,*4(11), 1173–1185, doi:10.1038/s41562-020-00951-3. [CrossRef] [PubMed]

*Psychological Science,*19(2), 196–204. [CrossRef] [PubMed]

*Scandinavian Journal of Statistics,*6(2), 65–70.

*Advances in Neural Information Processing Systems,*29, https://papers.nips.cc/paper/2016/hash/4e0d67e54ad6626e957d15b08ae128a6-Abstract.html.

*Journal of Vision,*11(9), 4, doi:10.1167/11.9.4. [CrossRef] [PubMed]

*Proceedings of The 27th Conference on Learning Theory*(pp. 40–67). PMLR.

*Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research,*38, 471–479. Available from https://proceedings.mlr.press/v38/kleindessner15.html.

*Journal of Machine LearningResearch,*18(58), 1–52.

*Modeling psychophysical data in R*. New York: Springer New York.

*Journal of Vision,*17(2), 7–7, doi:10.1167/17.2.7. [PubMed]

*Psychometrika,*29(1), 1–27, doi:10.1007/BF02289565.

*Psychometrika,*29(2), 115–129, doi:10.1007/BF02289694.

*ACM Transactions on Graphics,*38(4), 1–12, doi:10.1145/3306346.3323036.

*Perception & Psychophysics,*68(1), 76–83, doi:10.3758/BF03193657. [PubMed]

*Trends in Cognitive Sciences,*25(2), 94–96, doi:10.1016/j.tics.2020.12.003. [PubMed]

*Journal of Vision,*11(9), 16, doi:10.1167/11.9.16. [PubMed]

*Current Biology,*22(20), 1909–1913, doi:10.1016/j.cub.2012.08.009.

*Machine Learning,*52(3), 239–281, doi:10.1023/A:1024068626366.

*PLoS Computational Biology,*15(4), e1006950. [PubMed]

*2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*(pp. 3546–3556), doi:10.1109/cvpr46437.2021.00355.

*Behavior Research Methods,*51(5), 2180–2193, doi:10.3758/s13428-019-01285-3. [PubMed]

*Journal of the Optical Society of America A,*22(5), 801–809, doi:10.1364/JOSAA.22.000801.

*Vision Research,*44(13), 1511–1535, doi:10.1016/j.visres.2004.01.013. [PubMed]

*Journal of Vision,*7(6), 3, doi:10.1167/7.6.3. [PubMed]

*Journal of Vision,*17(2), 6, doi:10.1167/17.2.6. [PubMed]

*Psychometrika,*27(2), 125–140, doi:10.1007/BF02289630.

*Stimulus generalization*(pp. 94–110). Stanford, California: Stanford University Press.

*Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research,*32(2), 847–855. Available from https://proceedings.mlr.press/v32/terada14.html.

*ACM Transactions on Applied Perception,*17(2), 6:1–6:26, doi:10.1145/3380741.

*Dimensionality of the perceptual space of achromatic surface colors*. München: Hut.

*2012 IEEE International Workshop on Machine Learning for Signal Processing*, doi:10.1109/mlsp.2012.6349720.

*Stevens’ handbook of experimental psychology and cognitive neuroscience*(pp. 1–42). Hoboken, NJ: John Wiley & Sons, Inc.

*ACM Transactions on Graphics,*28(4), 1–15, doi:10.1145/1559755.1559760.