Abstract
Visual image reconstruction aims to recover arbitrary stimulus/perceived images from brain activity. To achieve reconstruction over diverse images, especially with limited training data, it is crucial that the model leverages a compositional representation that spans the image space, with each feature effectively mapped to brain activity. In light of these considerations, we critically assessed recently reported photorealistic reconstructions based on text-to-image diffusion models applied to a large-scale fMRI/stimulus dataset (Natural Scene Dataset, NSD). We found a notable decrease in the reconstruction performance of these models with a different dataset (Deeprecon) specifically designed to prevent category overlaps between the training and test sets. UMAP visualization of the target features (CLIP text/semantic features) with NSD images revealed a strikingly limited diversity with only ~40 distinct semantic clusters overlapping between the training and test sets. Further, CLIP feature decoders trained on NSD highlighted significant challenges in predicting novel semantic clusters not present in the training set. Simulations also revealed the inability to predict new clusters when the training set was restricted to a small number of clusters. Clustered training samples appear to restrict the feature dimensions that could be predicted from brain activity. Conversely, by diversifying the training set to ensure a broader distribution in the feature dimensions, the decoders exhibited improved generalizability beyond the trained clusters. Nonetheless, it is important to note that text/semantic features alone are insufficient for a complete mapping to the visual space, even if they are perfectly predicted from brain activity. Building on these observations, we argue that the recent photorealistic reconstructions may predominantly be a blend of classification into trained semantic categories and the generation of convincing yet inauthentic images (hallucinations) through text-to-image diffusion. To avoid such spurious reconstructions, we offer guidelines for developing generalizable methods and conducting reliable evaluations.