Abstract
We compare word recognition by deep neural networks (DNN) and humans, asking whether the effects of increased pooling in the network can model crowding in human vision. We focus our experiments on a "Convolutional Recurrent Neural Network" CRNN [Shi et al. 2016], a popular model for word recognition. We study efficiency and crowding of the network on word recognition. To measure efficiency, we assess the network's performance in recognizing random 4-letter words in mono-space font at various contrast levels on a white noise background. We find that the network has a lower efficiency than the human observer: in our experiments, we found that the network has roughly one tenth of the 3\% efficiency that the humans attain[Pelli et al., 2003].
Letter crowding in human vision results in a minimum threshold spacing, independent of letter size. Crowding is usually explained as inappropriately large pooling for the task at hand. We studied how the network's size and spacing thresholds would be affected by changing its pooling from 2 to 32. The network with modified pooling was trained as specified by the original authors. We measured word recognition accuracy as a function of letter size and spacing. For humans tested at any given eccentricity, there are two regimes, one limited by crowding, and one limited by acuity[Song et al. 2014]. In the crowding regime, the threshold size is inversely related to spacing ratio. In the spacing regime, the threshold is independent of the spacing ratio. In the network, our manipulation revealed only one regime for all pooling values: a slope of -0.3 for a log-log plot of acuity vs spacing ratio, unlike the human data, which has slopes of -1(crowding limited) and 0(acuity limited). Based on these results, we believe that there are important limitations in how well this network models human reading.