It has been claimed that the human letter identification system cannot be construed as, or even informed by, a linear amplifier model. We have observed that a perceptron network trained by the delta rule is analogous to a linear amplifier, and we have demonstrated that it can indeed capture several important aspects of human letter identification as revealed by the Bubbles method. Consequently a first contribution of this article is to show that the LAM-based analysis initially proposed by Murray and Gold (
2004a) has much more explanatory power and bearings on what the Bubbles method achieves than has been previously acknowledged (Gosselin & Schyns,
2004). This parallels the recent findings reported in Murray (
2012), that standard Bubbles images are very similar to the theoretical Bubbles images obtained from a linear model.
It is worth recalling here that the perceptron model we have used can only operate a direct mapping from raw input pixels to output units. It has no ability to rotate, scale, or reframe the input, no notion of symmetries, no spatial frequency filters with which to decompose and analyze the input, no simple or complex units that would detect oriented bars or edges, and no hidden layers that could perform any other kind of sophisticated computations. These are severe limitations, and indeed it has been known for decades that such networks without a hidden layer can only solve linearly separable discrimination tasks. The finding that this outrageously simple model can nevertheless accommodate so much of the Bubbles data on letter identification is therefore unsettling. At the very least, this demonstrates that the task of letter identification as defined in the experiments of Fiset et al. (
2008) would be linearly separable if the target letters were presented in clear conditions. This fact is likely to impact on the type of strategies and features used by subjects: Experiments with stimuli of randomly varying sizes, locations, or shapes within a session may require other processing strategies and diagnostic features from the subject that would presumably not be well captured by a linear perceptron model with a convolution operator and raw pixel inputs. As it happens, Watson and Ahumada (
2008) recently introduced a set of template matching models to predict visual acuity from aberrated retinal images of letter stimuli and under conditions of noise and location shifts. The models all used realistic preprocessing steps such as optical and neural transfer functions and differed only in their template matching procedure. They found that although an ideal observer model best captured human performance, all models performed at a high level of accuracy, including a linear template matcher using a cross-correlation matching operator. This establishes that linear template matchers that use more sophisticated operators and preprocessing steps can emulate human behavior in realistic conditions.
We have also shown how the perceptron with delta rule can outperform an ideal observer model on accounting for the types of features used by humans during letter identification. A case in point is that only the perceptron model places the most emphasis on “termination” features, like humans do. It is still unclear why this should be so, considering the above mentioned limitations of the network, in particular the lack of any edge feature detectors or of any preprocessing of the input (we note that these specific limitations are also shared with the ideal observer). Part of the reason for the model's behavior has to do with the exact placing of letter stimuli during training and the fact that the delta learning rule will give more importance to these features that are unique to a letter. Clearly, because not all letters are of same width, if they are centered on the input layer then terminations of wide letters like A, M, or W will be unique and selected by the delta rule. Although this argument does not go all the way to explaining the prevalence of terminations, some of which are emphasized despite overlapping significantly across letters (for instance terminations in I, J, K, and L), it can actually explain the importance of vertical features over horizontal ones for the model. Indeed, horizontal features will have more overlap in the training set because they tend to fall always on the top, middle, or bottom of Arial letters and because letter inputs are adjusted for height. Hence and contrarily to the experimental data, these features will be less diagnostic for the model than vertical features, which exhibit much more variability in locations.
It should be noted that both the perceptron and the ideal observer model used by Fiset et al. (
2008) point to a very local type of explanation for the diagnostic features used by human subjects in bubbles experiments. In the ideal observer model, the nature and relative importance of letter features are entirely determined by the experimental stimuli and the bubbles distribution during test trials. In the perceptron model, these same features are determined by letter templates discovered by the delta rule from the experimental stimuli themselves. In neither case is the knowledge acquired on other letter exemplars or with other viewing conditions being taken into account, which again greatly restricts the type of explanations that can be provided by these models. In fact the success of the perceptron model at mimicking human behavior actually suggests that much, though clearly not all (e.g., horizontal vs. vertical features), of what is being discovered through a Bubbles experiment is independent from the details of the subjects' histories and from their previous expertise with visual letters. Indeed in our simulations this rich previous history is not taken into account, and learning is simplified as a process of repeated exposure to standardized letter targets, which are exactly those used in the bubbles experiments.
We would argue that none of the limitations we have discussed for the LAM and the perceptron, including their linearity, are insurmountable. The LAM itself essentially only makes a statement about the existence of letter templates and of a linear mechanism involved in comparing them to inputs, but it says nothing about the sophistication of letter templates, which could be arbitrarily complex or high-dimensional. In our perceptron implementation of the LAM, the low sophistication of letter templates directly reflects the simplistic nature of the input code, but there are a number of other codes that could advantageously replace these crude pixel inputs, from pyramidal kernels (Bosch, Zisserman, & Munoz,
2007) to SIFT features (scale invariant feature transforms; Lowe,
2004) or shape contexts (Mori, Belongie, & Malik,
2005) to name a few.
Some of these codes attempt to emulate the properties of primary visual areas (Pinto, Barhomi, Cox, & DiCarlo,
2011), while other codes attempt to integrate natural image statistics. In fact taking such a step would recast letter identification into the more mainstream research effort that is generic object recognition. We have argued that some limitations of the model arise because it uses a unique supervised process, and indeed for most computer vision scientist and many neuroscientists of vision, it is useful to distinguish between two stages when modeling visual processes. The first stage essentially performs an analysis of the visual input into universal features; it is often unsupervised and can use deep networks or any of the above mentioned codes (e.g., Serre, Oliva, & Poggio,
2007), whereas the later stage is one of feature selection specific to the task, which uses supervised classifiers that are most commonly of the generalized linear type known as support vector machines (see for instance Pinto, Barhomi, Cox, & DiCarlo,
2008). Following on the procedure outlined in recent instanciations of state-of-the-art visual object recognition models like HMAX (hierarchical model and X; Serre et al.,
2007), the LAM/perceptron's input code could be upgraded to the product of an unsupervised learning process that turns images into a high-dimensional vector of frequently occurring feature combinations detectors, as determined by the statistics of a training base of natural images. A supervised linear classifier operating on such a code could possibly explain why for instance horizontal features are preferred by human subjects over vertical ones. Last but not least, turning a set of highly entangled patterns into a high-dimensional feature space through a nonlinear function is a demonstrated way to obtain linearly separable patterns (Cover,
1965; see also DiCarlo & Cox,
2007 for a discussion in the context of human vision).