Abstract
Artificial Neural Networks (ANNs) have been proposed as computational models of the primate ventral stream, because their performance on tasks such as image classification rivals or exceeds human baselines. But useful models should not only predict data well, but also offer insights into the systems they represent, which remains a challenge for ANNs. We here investigate a specific method that has been proposed to shed light on the representations learned by ANNs: Feature Visualizations (FVs), that is, synthetic images specifically designed to excite individual units ("neurons") of the target network. Theoretically, these images should visualize the features that a unit is sensitive to, like receptive fields in neurophysiology. We conduct a psychophysical experiment to establish an upper bound on the interpretability afforded by FVs, in which participants need to match five sets of exemplars (natural images that highly activate certain units) to five sets of FVs of the same units---a task that should be trivial if FVs were informative. Extending earlier work that has cast doubts on the utility of this method, we show that (1) even human experts perform hardly better than chance when trying to match a unit's FVs to its exemplars and that (2) matching exemplars to each other is much easier, even if only a single exemplar is shown per set. Presumably, this difficulty is not caused by so-called polysemantic units (neurons that code for multiple unrelated features, possibly mixing them in their visualizations) but by the unnatural visual appearance of FVs themselves. We also investigate the effect of visualizing units from different layers and find that the interpretability of FVs declines in later layers---contrary to what one might expect, since later layers should represent semantic concepts. These findings highlight the need for better interpretability-techniques, if ANNs are ever to become useful models of human vision.