Abstract
Recently, similarity judgement tasks have been employed to estimate the perceived similarity of natural images (Hebart, Zheng, Pereira, & Baker, 2020). Such tasks typically take the form of triplet questions in which participants are presented with a reference image and two additional images and are asked to indicate which of the two is more similar to the reference. Alternatively, participants can be presented with three images and asked to indicate the odd one out. Though both questions are mathematically similar, they might affect participants’ decision criteria, the agreement among observers, or the consistency of single observers—these possibilities have hitherto not been assessed. To address these issues, we presented four observers with triplets from three image sets designed to juxtapose different perceptual and conceptual features. Using a soft ordinal embedding algorithm—a machine learning version of a multidimensional scaling—we represented the images in a two-dimensional space such that the Euclidean distances between images reflected observers' choices. Agreement between observers was assessed through a leave-one-out procedure in which embeddings based on three observers served to predict the respective fourth observer's choices. Consistency was calculated as the proportion of identical choices in a repeat session. Here we show that design choices in similarity judgement tasks can indeed affect results. The odd-one-out design resulted in greater embedding accuracy, higher agreement among, and higher consistency within observers. Hence, an individual observer's choices could be better predicted in the odd-one-out than in the triplet design. However, predicting individual responses was only possible for image sets for which participants could report a predominant relationship. Otherwise, predictability dropped to close to chance level. Our results suggest that seemingly innocuous experimental variations—standard triplet versus odd-one-out—can have a strong influence on the resulting perceptual spaces. Furthermore, we note severe limitations regarding the predictive power of models relying on pooled observer data.