Abstract
This study investigates the capabilities of GPT-4, an advanced language model with integrated vision capabilities, in interpreting complex mental states using the Reading the Mind in the Eyes Test (RMET). The RMET involves identifying subtle emotional and mental states from photographs of the region immediately around the human eyes. It is comprised of 36 photographs along with four descriptors of the person’s mental state. Like in human studies, we prompted GPT4 (API = gpt-4-vision-preview) to “Choose which word best describes what the person in the picture is thinking or feeling. You may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable. Your 4 choices are: …” We conducted five iterations of the RMET. GPT-4 produced an average of 25.4 items out of 36 correct (SD = 0.89), aligning closely with 'typical' general population human performance range (~25 – 26 items correct). Notably, inverting the images led to a 30% decrease in performance, which is less than the 50% decrease seen in humans, revealing a reliance on global, wholistic processing. Block scrambling the images (2 x 5 grid format), which preserved eye-size features but renders the images nearly unrecognizable to human observers, had almost no impact on GPT-4's performance (24 items correct). This surprising finding suggests that GPT-4's analysis of visual information may prioritize local features (eye gaze, eyebrow characteristics, etc.), over more global aspects of the image. These results provide insight into understanding an AI's visual processing mechanisms, indicating an interplay of feature-specific and holistic image analysis. Overall, the findings show that GPT-4 demonstrates a significant level of competence in recognizing a range of mental states, indicating its potential in applications requiring sophisticated emotional and cognitive understanding.