Abstract
How do people align concepts learned from different modalities, such as visual and linguistic inputs? To address this question, we examined the representations of emojis, which are pictograms commonly used in linguistic contexts and convey distinctive visual characteristics that make them appear engaging. The representational similarity structure of emojis was measured using an odd-one-out paradigm. In Experiment 1, human similarity judgments were measured for 48 emojis from a wide range of emoji categories (faces, animals, objects, signs, etc). We compared human similarity judgments with model predictions from three types of models, including a language model (fastText) that is trained for word prediction in sentences, a vision model (Visual Auto-Encoder) that is trained to reconstruct input images, and a multi-modal neural network (CLIP) that learns visual concepts under language supervision. We found that CLIP correlated with human similarity judgments the highest (rho = .38), followed by fastText (rho = .36), and Visual Auto-Encoder (rho = .17). When controlling for linguistic semantics from fastText, CLIP maintained the significant semipartial-correlation with human judgments (sr = .34). The best performance of CLIP was not simply due to the combination of multimodal inputs since simply concatenating fastText and Visual Auto-Encoder embeddings resulted in a lower correlation (rho = .17). In Experiment 2, we used the 50 most frequently used emojis, which mostly include faces with different expressions and hand gestures. We found that all three models show correlations with human similarity judgments: CLIP (rho = .68), followed by fastText (rho = .52), and Visual Auto-Encoder (rho = .46). These results suggest that models trained with aligned visual and linguistic inputs in a multi-modal way best capture human conceptual representations of visual symbols, such as emojis. However, these models trained with general purposes are inadequate to capture fine-grained social attributes in emojis.