Abstract
Our ability to interpret scenes builds upon segmenting images into same-texture regions, which are usually the same physical ‘stuff’. How do we do this so immediately on arbitrary new images with novel textures, under varying lighting and geometry, with a visual system whose resolution degrades rapidly away from fixation? We conduct the first direct measure of human ability to identify if two 1°×1° grayscale texture patches (widely sampled to prevent learning specific textures) are the same ‘stuff’, when presented at the same or two different locations, which include the fovea and three eccentricities. We also develop the first models for such general discrimination, using two image-computable approaches that can incorporate biological properties like eye optics and the ganglion-cell sampling resolution at the stimulus location. In the first model, we define important texture features like the luminance histogram, power spectrum, and edge properties. Then we measure and model the statistical distribution of these features across textures, using which we build a Bayesian ideal observer for same-different discrimination. With a single fixed decision boundary on only two features, this discriminates arbitrary texture patches with 93% accuracy, and aligns qualitatively with human performance. The second class of models are convolutional neural networks that can discover new features, using which we also achieve good discrimination. Biological vision does not need numerous labelled samples to learn texture discrimination. Instead, it is likely self-supervised: primitive vision plausibly used coarse features to segment images, then evolution boot-strapped these labels to learn texture discrimination. We implemented this by training the Bayesian texture-discriminator on natural image patches segmented by proximity and color, and the same decision boundary emerged as with explicit texture labels.