Abstract
Scenes are composed not only of discrete objects with defined shapes but also complex visual “stuff” in the form of amorphous textures and patterns (e.g., grass, bricks, smoke). Many behaviorally relevant properties of scenes can be quickly recognized based solely on the stuff they contain—e.g., hotness of fire, hardness of concrete, fragility of glass. Though much work has explored how the human visual system represents individual objects, less is known about how we process the amorphous stuff that makes up most of the visual environment. Furthermore, an open question is to determine what classes of computational models can account for the human ability to rapidly detect a rich set of high-level properties from a brief glance at a patch of visual stuff. To address these questions, we developed a dataset of 500 high-quality images from 50 categories of textures encountered in real-world environments, and we collected annotations of readily identifiable qualitative properties of these images (e.g., material properties, haptic properties, semantic attributes). In preliminary investigations, we sought to determine whether computational models trained for object recognition also yield representations that are useful for predicting the qualitative properties of visual stuff. Our findings show that while many perceptual and haptic properties (e.g., bumpiness) are equally well-predicted by both supervised and untrained convolutional neural networks, high-level semantic attributes (e.g., naturalness) were much better predicted by supervised models. Nonetheless, all models failed to reach the noise ceiling for the majority of the property annotations, demonstrating that further work is needed to account for the richness of human stuff perception. Our dataset will provide a critical benchmark for computational models and will be useful for follow-up studies that seek to understand how humans recognize the many qualitative properties of the stuff in their visual surroundings.