Object recognition is one of the most complex tasks the visual system faces. Images of objects undergo severe transformations due to variations in location, size, orientation, and illumination. This presents a challenge to any recognition algorithm, whether biological or computational, as recognition of the object should be invariant to image variations that reflect viewing conditions, while staying sensitive to those image properties that reflect the difference between different objects. Human object recognition is exquisitely robust in this respect. Consider the font and size changes one faces in reading. For instance, the two characters,
and
are readily identifiable as the same letter even though the corresponding physical stimuli differ substantially.
Invariance to size has been demonstrated for various aspects of human object recognition. For example, training with different-size exemplars provided similar benefits in an object naming task as training with same-size stimuli (Furmanski & Engel,
2000). Another study of object naming found that the magnitude of a priming effect did not depend on whether the sizes of prime and test stimuli matched (Biederman & Cooper,
1992). Efficiency of letter identification and reading rate are only weakly affected by changes in letter size (Legge, Pelli, Rubin, & Schleske,
1985; Parish & Sperling,
1991; Pelli, Burns, Farell, & Moore-Page,
2006).
The fact that the visual performance of human observers is found to be scale invariant can be interpreted as indicating that the underlying recognition processes must also be scale invariant. However, recent evidence suggests otherwise. Majaj, Pelli, Kurshan, and Palomares (
2002) have shown that the critical band of spatial frequencies for recognizing letters changes with letter size. Large letters are recognized with their details (higher frequency components) whereas small letters are recognized with their large strokes (lower frequency components), a finding that has been replicated by others (Chung, Legge, & Tjan,
2002; Oruc & Landy,
2009).
If object-recognition processes are inherently scale dependent, why do we not notice this in our everyday visual experience? The hybrid images of Oliva, Torralba, and Schyns (
2006) provide an example where the scale-dependent nature of object perception is evident. Hybrid images are composites of multiple visual objects, each occupying a separate frequency band. Scale-invariant processing would mean that the percept of hybrid images would be the same regardless of image size. In reality, at different sizes different components of the hybrid image dominate the overall percept. For example, in
Figure 1 most viewers report seeing Botticelli's Venus in the large image at the top, and the iconic “Love” image by Robert Indiana in the small image at the bottom. Actually, these two images are identical, simply printed at different sizes, which can be verified by standing 3–5 m away from the page or screen and observing that
Love replaces
Venus in the larger top image at this far viewing distance. Designing hybrid images that work requires knowledge of preferred, or critical, frequency bands for the component visual objects at various sizes and provides phenomenological evidence of scale dependence in our object recognition system. Such demonstrations are valuable for illustration purposes but do not provide proof of scale dependence. The most convincing evidence instead comes from systematic psychophysical experimentation (Chung et al.,
2002; Chung & Tjan,
2009; Majaj et al.,
2002; Oruc & Landy,
2009).
It was initially assumed that scale dependence reflected constraints on visual contrast sensitivity in human observers (Chung et al.,
2002; Oliva et al.,
2006; Oruc,
2003; Oruc, Landy, & Pelli,
2006). This account is based on the shape of the contrast sensitivity function (the CSF-based account), i.e., the fact that human sensitivity for very low and high spatial frequencies is limited compared to middle frequencies. Changes in object size can thus render some frequency components of an object hard to detect. For example, higher frequency components of a letter may be easily discerned when the letter is large, but when the letter is small these components will be located in even higher spatial frequencies as far as the retinal image is concerned, frequencies to which humans are far less sensitive. The pattern of scale dependence in human observers is in qualitative agreement with the predictions of the CSF-based account, in so far as the changes in preferred frequencies are in the expected direction (Chung et al.,
2002; Oruc,
2003; Oruc et al.,
2006). However, this account has recently been challenged. For example, one can make contrast sensitivity relatively equivalent across all spatial frequencies, i.e., obtain a considerably flatter CSF, by the addition of external white noise. Insensitivity to a particular spatial frequency is often modeled through the presence of higher internal noise in the neural processing of that stimulus (Ahumada & Watson,
1985). In other words, for higher and lower spatial frequencies, internal noise exceeds that for the middle frequencies. Consequently, the addition of high-power external white noise to the stimuli swamps the internal noise and the relatively minor differences in the internal noise become negligible. As a result, thresholds for all frequencies are raised considerably, and the characteristic shape of the CSF is rendered relatively flat. If the CSF-based account is correct, addition of external white noise should eliminate scale dependence, but it does not (Oruc & Landy,
2009), a finding that demonstrates that scale dependence has a deeper origin than low-level constraints on contrast sensitivity.
Many of these observations on scale dependence derive from experiments using letters as stimuli. One important question is whether such results generalize to all other objects, or if there are some fundamental differences between certain types of objects that may affect the results. Letters form an interesting class of stimuli. Although letters and written text may have been designed and in time tailored to broadly suit basic human visual capabilities, they remain an artificial class with which humans develop an arbitrary expertise that is culturally determined. Whether one develops an expertise for English or Korean symbols is an accident of birth or education. It is thus unlikely that the human brain has hard-wired neural machinery for recognizing a specific writing system, as literacy is a relatively recent phenomenon and human scripts differ in form significantly from one culture to another. One could question whether the lack of scale invariance found with letters reflects the arbitrary and artificial nature of written text. If so, it would be of interest to examine scale dependence with stimuli that have greater universality and longer evolutionary significance for humans: faces, in particular, may constitute such a class.
We examined the degree of scale dependence for five sets of stimuli that we could characterize in terms of two basic factors: experience and evolutionary significance (
Figure 2). (1) Letters constitute a stimulus set with which literate subjects have a high degree of experience, but which as stimuli have low evolutionary significance, for the reasons stated above. (2) Mirror-image letters we considered to represent an intermediate level for experience, in so far as subjects have far less exposure to such letters but can still easily recognize these as transformed letters, and low evolutionary significance set. (3) Novel shapes are those with low experience and low evolutionary significance. (4) Upright faces are as a stimulus set with high evolutionary significance and a high degree of experience, as all humans are raised with significant daily exposure to upright faces. (5) Inverted faces represent high evolutionary significance and low/intermediate degree of experience, in so far as faces tend to be encountered far more frequently in the upright orientation. We used critical band masking (Solomon & Pelli,
1994) to estimate the spatial frequencies that are predominantly used in recognizing these stimuli at various sizes and determined how these critical frequencies change with size.
If scale invariance requires evolutionary time scales to be built into specialized neural mechanisms, then evidence for scale invariance should only be found for the face stimuli. Furthermore, if experience has an impact, then the degree of scale dependence should be less for those stimulus classes with which humans have more experience and familiarity.