Abstract
Remarkably, human brains have the ability to accurately perceive and process the real-world size of objects, despite vast differences in distance and perspective. While previous studies have delved into this phenomenon, distinguishing this ability from other visual perceptions, like depth, has been challenging. Using the THINGS EEG2 dataset with high time-resolution human brain recordings and more ecologically valid naturalistic stimuli, our study uses an innovative approach to disentangle neural representations of object real-world size from visual size and perceived real-world depth in a way that was not previously possible. Leveraging this state-of-the-art dataset, our EEG representational similarity results reveal a pure representation of object real-world size in human brains. We report a representational timeline of visual object processing: pixel-wise differences appeared first, then real-world depth and retinal size, and finally, real-world size. Additionally, we input both these naturalistic images and object-only images without natural background into artificial neural networks. Consistent with the human EEG findings, we also successfully disentangled representation of object real-world size from visual size and real-world depth in all three types of artificial neural networks (visual-only ResNet, visual-language CLIP, and language-only Word2Vec). Moreover, our multi-modal representational comparison framework across human EEG and artificial neural networks reveals real-world size as a stable and higher-level dimension in object space incorporating both visual and semantic information. Our research provides a detailed and clear characterization of the object processing process, which offers further advances and insights into our understanding of object space and the construction of more brain-like visual models.