Abstract
Limitations of human letter recognition indicate a bottleneck in the combining of visual information to recognize a letter. Signal Detection Theory shows that the ideal observer for a signal in white noise does template matching, and its performance depends solely on the signal-to-noise ratio (SNR), independent of signal complexity. Surprisingly, human threshold SNR is proportional to letter complexity (Pelli et al. 2006), suggesting that only a limited number of features can be combined for identification. To better understand this limitation of human observers, we trained an artificial neural network to identify letters, hoping to discover what network design characteristics would make its threshold depend on complexity. We used a convolutional neural network (ConvNet), a popular multi-layer neural network architecture for object and letter recognition (LeCun et al. 1998). We created multiple sets of images consisting of a letter added to a Gaussian white-noise background, varying noise level across different sets. We used seven fonts, spanning a ten-fold range of complexity. For each font, we trained the network to identify letters, using all the noise levels together to train and then testing accuracy at each noise level to determine the threshold SNR required for 64% correct identification of new test images. With extensive resources, ConvNet has a much lower threshold than humans, and exhibits only a weak dependence on font complexity (with a log-log slope of 0.5). With restricted resources (two convolutional layers respectively containing 6 & 12 convolutional filters followed by 60 fully-connected units), the threshold of ConvNet rises to human levels (0.10 RMS error in log SNR threshold across 7 fonts) and a log-log slope of 1 (i.e. proportional). Thus we find that a ConvNet with restricted resources closely matches human thresholds for letter identification in seven fonts that span a tenfold range of complexity.
Meeting abstract presented at VSS 2014