Abstract
Seeking a better understanding of how we recognize letters, we compared the letter recognition performance of human subjects with that of a biologically plausible neural network (Fukushima’s Neocognitron). This type of neural network, inspired by the architecture of the visual system, has been successful in OCR and natural scene classification, but so far has not been compared directly to human letter recognition. We were particularly interested in the errors made when letters were degraded, such as in the presence of noise.
First, we confirmed that the network is able to recognize lower-case letters, which has not been shown before. Trained on just ten presentations of each of the 26 letters in Times font, it was robust to letter rotation (+/-45°=62% correct), spatial warping (+/-50% of character size in both dimension=75% correct), and spatial translation.
Next, following the analyses of Solomon/Pelli (1994) and Chung, et al. (2002), we evaluated the model "letter channels" using stimulus filtering and filtered noise masking. Unlike the lowpass ideal observer described by Solomon/Pelli (1994), this model has a bandpass shape very similar to human observers, centered around 2-3 cycles/letter.
Finally, we compared confusion matrices from new experiments, classic published results, and model predictions. After removing bias using the Luce choice model, we examined correlations between the remaining letter similarity score matrices, indicating typical confusions between letters. Correlations between the simulation and new experimental data (subjects recognizing letters in noise) were 0.62-0.7, slightly worse than agreement between observers (r=0.8-0.9). When compared to Bouma’s (1971) matrix, which used a Courier font, the model trained on Times had a low correlation (r=0.32), while the Courier-trained model had a fit of r=0.64.
We believe the ability of this model to capture the particular letter confusions of humans makes it a promising testbed for probing intermediate-level object recognition.
Meeting abstract presented at VSS 2012