Color vision has been studied traditionally using simple laboratory stimuli such as patches, Gabors, and gratings. Only recently have images of natural scenes been used to study color vision (e.g., Brainard, Rutherford, & Kraft,
1997; Fine & MacLeod,
2001; Johnson & Baker,
2004; Párraga, Troscianko, & Tolhurst,
2002; Ruderman & Bialek,
1994; Webster & Mollon,
1997; Yoonessi & Kingdom,
2008). Although the complexity of natural scenes makes the data obtained from them sometimes difficult to interpret, they offer a unique opportunity to study how the structural properties of the natural visual environment influences color perception.
In a recent study we measured human sensitivity to a range of color transformations applied uniformly across images of natural scenes (Yoonessi & Kingdom,
2008). The transformations were rotations and translations applied to the color space that defined all pixel values. We found that for all types of transformation, sensitivity was higher for the raw images of natural scenes compared to their phase-scrambled counterparts. Control experiments ruled out that the differential sensitivity was due to the familiarity of the colors in the raw scenes, suggesting instead that it was due to the raw scene's unique spatial structure. Raw natural scenes typically consist of patterns of edges separated by uniform regions, and we conjectured that this was the critical features.
One of the issues raised by Yoonessi and Kingdom's (
2008) study concerns the locus of the mechanisms responsible for the higher raw-scene sensitivity. Assuming that the detection of uniform color changes is a relatively low-level visual process, one that could in principle be mediated by monocular as well as binocular neurons, a legitimate question is whether the higher sensitivity is mediated by monocular or binocular mechanisms. The principle aim of the present study is to answer this question. Knowing the locus of the differential sensitivity could be useful in determining the nonlinearities that are responsible.
The retinal images in the two eyes are very similar under normal viewing conditions, the slight differences between them arising from retinal disparity. Normally the between-eye differences due to disparity are not perceived as differences or anomalies—instead they are exploited by stereopsis to provide an impression of a unitary three-dimensional world. However, artificial differences between stimuli presented to the two eyes can result in departures from unitary vision, of which rivalry, is the most commonly studied form (Blake,
2001). Recently, Malkoc and Kingdom (
2004) described a new measure of non-unitary binocular perception: the ‘Dichoptic Difference Threshold,” or DDT. The DDT is the minimum detectable difference between two dichoptically superimposed stimuli. The DDT is a performance-based rather than appearance-based measure. If the difference between a dichoptic pair is gradually increased from zero, a point is reached where the stimulus takes on a slightly lustrous appearance, and it is this that enables it to be distinguished from dichoptically identical stimuli.
Figure 1 illustrates the effect. If one free-fuses the two pairs of stimuli, the bottom, dichoptically different stimulus should appear lustrous. Because the lustrous appearance occurs at much smaller between-eye differences than are required to elicit rivalry, DDTs are much lower than thresholds for binocular rivalry (Malkoc & Kingdom,
2004).
DDTs offer a simple means to determine whether the positive effects of natural scene structure on sensitivity to color transformations are mediated by mechanisms at an early stage within monocular channels, or after the signals from the two eyes are combined. If they are mediated within monocular channels, we would expect to find a similar pattern of results for DDTs as for discriminands in plain view i.e. lower thresholds for raw compared to phase-scrambled scenes. On the other hand if they are mediated by channels after the point of binocular combination, then we would not expect a difference between raw and phase-scrambled DDTs. The present study will test between these two alternatives.
What are the color transformations that we have employed? They are luminance and/or chromatic changes applied uniformly across the image. The transformations are implemented by rotating or translating the three-dimensional color space defining the colors and luminances of every pixel in the image. Sample transformations applied to an image are shown in
Figure 2. The color space employed here is a modified version of the MacLeod–Boynton color space (MacLeod & Boynton,
1979) designed by Ruderman, Cronin, and Chiao (
1998). The axes in the color space represent the responses of the three postreceptoral channels: the luminance-sensitive channel that sums the outputs of the L (long-wavelength-sensitive) and M (medium-wavelength-sensitive) cones; a chromatically sensitive channel that differences the outputs of the L and M cones and is known as the ‘L–M’ channel; a chromatically sensitive channel differences the sum of the L and M cone responses from the S (short-wavelength-sensitive) cone response and is known as the ‘S–(L+M)’ channel. Because these channels are often (though strictly speaking incorrectly) referred to as the ‘luminance,’ ‘red–green,’ and ‘blue–yellow’ channels, we will employ this terminology. In one form of MacLeod–Boynton color space the three postreceptoral channel axes are formed by appropriate combinations of cone contrast, where cone contrast is defined as Δ
L/
Lb, Δ
M/
Mb, and Δ
S/
Sb. The denominator in each cone contrast term is the cone response to the background, which is assumed to determine the state of cone adaptation. While this is a reasonable assumption for briefly presented stimuli such as gratings, or low contrast patches, it is arguably inappropriate for natural scenes that tend to be of high contrast and for which cone adaptation is likely determined locally rather than across the scene as a whole (Brown & Masland,
2001; Ledda, Santos, & Chalmers,
2004; Shapley & Hawken,
2002; Wallach,
1948). A particularly undesirable consequence of using conventional cone contrast to represent the three postreceptoral channel layers of natural scenes is that the red–green layer spuriously picks up pure-luminance shadows (Olmos & Kingdom,
2004b; Párraga et al.,
2002). The use of logarithmic-based cone contrasts is one way to avoid this problem (Olmos & Kingdom,
2004b; Ruderman et al.,
1998).
For the experiments described below we used 50 images of natural scenes—termed ‘raw’—and their phase-scrambled versions. Each image was decomposed into three layers based on the modeled responses of the luminance, red–green, and blue–yellow postreceptoral channels. Each layer was then transformed by translation and rotation. Thresholds for detecting the transformations were measured under four conditions:
-
raw scenes, with the discriminand pair placed side by side and viewed monocularly;
-
phase-scrambled scenes, with the discriminand pair placed side by side and viewed monocularly;
-
raw scenes, with the discriminand pair superimposed dichoptically;
-
phase-scrambled scenes, with the discriminand pair superimposed dichoptically.
In order to compare the various types of luminance/chromatic transformations, we have measured thresholds defined in terms of a simple and intuitively appealing metric of image distance: the Euclidean distance, or
L 2 norm.
E can be calculated using the following formula:
where
p ni and
q ni are the intensities of the corresponding pixels in the two images, with
i being the image layer (
i = 1:3),
n the pixel (i.e., with unique
x, y coordinate), and
N the number of pixels per image. Euclidean distance has the important property that it defines a straightforward measure of the distance between two images, that is the same answer irrespective of the orthonormal basis used to represent the images, e.g., pixels, Fourier, Haar, etc. (Horn & Johnson,
1985). Euclidean distance has been previously employed to compare sensitivities to a variety of transformations applied to natural scenes (Kingdom, Field, & Olmos,
2007). It is important to state at the outset however that we are not arguing that Euclidean distance is the proper
perceptual metric. Rather, we argue that
E is a relatively neutral metric, providing a useful measure for comparing the relative sensitivities to the different types of chromatic/luminance transformations that we have used.