Abstract
Machine learning has found many applications in research for higher visual functions. However, few studies have applied machine learning algorithms to understand early visual functions. We applied a deep convolution neural network (dCNN) to analyze human responses to basic image statistics. The stimuli were band-passed random dot textures whose pixel luminance distribution was modulated away from uniform by a linear combination of Legendre polynomials with orders i and j, where i = 2 or 3, and j was from 1 to 8 but not i. There were 30 modulation depths from 0 (uniform distribution) to 1 (some luminance level had zero probability) for each polynomial pair. The Gaussian spatial frequency bands had peaks that ranged from 2 to 32 cyc/deg and a half-octave space constant. Each of six observers classified 7500–22500 textures by contrast, skewness, glossiness, naturalness or aesthetic preference. The psychophysical results served as ground truth for a VGG16 dCNN pretrained for Imagenet. The decisive layer was identified by removing the convolution layers one by one and observing the point when validation accuracy of the network, with retrained output layer, dropped below 80%. The decisive layer for contrast discrimination contained filters with a profile that consisted of repeated geometric patterns, suggesting a general texture processing mechanism. The decisive layer was the same for all of glossiness, naturalness, and aesthetic preference and comprised filters whose profiles looked like parts of objects. The spatial frequency tuning function, assessed by the validation accuracy with one spatial frequency band left out from the training set was low-pass for all properties except contrast, which showed an inverted-W shape peaked at 4 and 16 cyc/deg. Our results suggest possible properties of the visual mechanisms used to sense texture qualities, and our analysis also shows that dCNN can be a useful tool for early vision research.
Acknowledgement: MOST(Taiwan) 105-2420-H-002 -006 -MY3