One of the fundamental unanswered questions in visual science regards how the visual system attains a high degree of invariance (e.g., position invariance, size invariance, etc.) while maintaining high selectivity. Although a variety of theories have been proposed, most are distinguished by the degree to which information is maintained or discarded. To test whether information is maintained or discarded, we have compared the ability of the human visual system to detect a variety of wide-field changes to natural images. The changes range from simple affine transforms and intensity changes common to our visual experience to random changes as represented by the addition of white noise. When sensitivity was measured in terms of the Euclidean distance ( *L* _{2} norm) between image pairs, we found that observers were an order of magnitude less sensitive to the geometric transformations than to added noise. A control experiment ruled out that the sensitivity difference was caused by the statistical properties of the image difference created by this transformation. We argue that the remarkable difference in sensitivity relates to the processes used by the visual system to build invariant relationships and leads to the unusual result that observers are least sensitive to those transformations most commonly experienced in the natural world.

*geometric*and

*photometric*. The former refers to changes in the positions of image pixels, while the latter refers to changes in the intensive and/or spectral content of image pixels. The geometric transformations in Figure 1 are affine transformations on the two-plane (Watt, 2000). Many of these are quite common in our visual experience. Image “translation” occurs every time we move our eyes, and an image “contraction” every time we move away from an object. Others, such as “stretch” and “shear,” constitute

*distortions*that may be less common but occur as the observer moves through an environment. The photometric transformations in Figure 1 are of two classes. Uniform photometric transformations impose the same change to all pixel values: “flatten,” “brighten,” and “divide” are the examples. Random photometric or “noise” transformations are random perturbations applied either independently to every pixel, as in the “Gaussian noise” example, or independently at different image scales, as in the “fractal noise” example. Of the uniform photometric transformations, “divide” is probably the most commonly experienced as it occurs every time there is a reduction in the ambient light level, as when going from day to night. With the exception of significant levels of photon noise seen under low light conditions, the transformations that involve added noise would normally never occur in our visual experience, and therefore also constitute distortions.

*E,*or

*L*

_{2}norm. If the images are tri-plane, RGB colored images, as in Figure 1,

*E*can be calculated using the following formula:

*p*

_{ ni}and

*q*

_{ ni}are the intensities of the corresponding pixels in the two images, with

*i*the image plane (

*i*= 1:3 ∣

*R,*

*G,*

*B*),

*n*the pixel (i.e., with unique

*x,*

*y*coordinate), and

*N*the number of pixels per image. Euclidean distance has the important property that it defines a straightforward measure of the distance between two images and provides the same answer irrespective of the orthonormal basis used to represent the images, e.g., pixels, Fourier, Haar, etc. (Horn & Johnson, 1990). We are certainly not arguing that the Euclidean distance is the proper

*perceptual*metric. Rather, we argue that

*E*is a relatively neutral metric, providing a useful measure for comparing the relative sensitivities to the different types of image transformation shown in Figure 1. It is widely believed that simple visual discrimination tasks are mediated by filters in the early stages of the visual cortex, for example primate area V1, that are tuned to various orientations and spatial frequencies (DeValois & DeValois, 1991). Under the most simplistic model where we assume that the visual system calculates the differences between images from the differences between the magnitudes of m linear, orthonormal filter responses, the Euclidean distance calculated from the filter responses produces similar answers to that calculated from pixel intensities. We should also emphasize that Euclidean distance is a somewhat unusual metric for describing affine transforms. In a Euclidean pixel space, most affine transforms represent a curved trajectory through the space. Although a monotonic increase in the affine transformation (e.g., a shift to the left) will typically result in a monotonic increase in the Euclidean distance, it is not a simple linear relationship. Therefore, although Euclidean distance is a valid metric of physical distance between two images and is easily calculated, we do not expect it to be an accurate perceptual metric. Indeed, it is the failure of this physical metric which is the core of this study.

*same*natural scene (different scenes on each trial) and were required to indicate which of the pair conformed to a particular transformation. In the second experiment ( Figure 2b), subjects were presented on each trial with two images of

*different*scenes (different scene-pairs on each trial) and were required to indicate which of the pair conformed to a particular transformation. For the second experiment, only those transformations that could be considered distortions are applicable, and therefore we only tested “stretch,” “shear,” and “added noise.” Importantly, the distortion class of image transformation is the only class uniquely applicable to natural scenes, since knowledge of what is “normal” in a scene is pre-requisite. For the second experiment, we still measured the magnitude of the transformation in terms of

*E,*even though the baseline image was not presented with its transformed version on the same trial.

^{2}. Intensity resolution was 24 bits (256 levels for each

*R,*

*G,*and

*B*image). Each image was circular with a diameter of 300 pixels subtending 11 deg at the viewing distance of 100 cm. The stimulus edges were softened using a 0.55 × 0.55 deg Gaussian filter with a standard deviation of 2 deg. Each stimulus was presented for a total of 500 ms with a temporal ramp of 100 ms at stimulus onset and offset.

*x,*

*y*are the original and

*x*′,

*y*′ the transformed image pixel coordinates. For the four classes of geometric transformation, the matrix coefficients were

*s*

_{1}and

*s*

_{2},

*t*

_{1}and

*t*

_{2}, and

*h*

_{1}and

*h*

_{2}are the transformation levels, with subscripts 1 and 2 for the

*x*(horizontal) and

*y*(vertical) coordinates.

*θ*is orientation in degrees. For the scale transformation,

*s*

_{1}and

*s*

_{2}were covaried. For the stretch horizontal transformation,

*s*

_{2}was set to zero while

*s*

_{1}was varied, and similarly for the horizontal and vertical versions of the translation and shear transformations.

*x*'s and

*y*'s. We employed a bi-cubic interpolation method, in which the new pixel value was the weighted average of the four neighboring pixel values. Although the range of transformations was tailored to each subject to ensure an average performance of about 75% correct, the total range across subjects for the different geometric transformations was 0.2–52% of image width/height for scaling (specifically contraction); 0.1–45 deg for rotation; 0.001–1.3 aspect ratio for shear; and 0.1–5.7% of image height/width (corresponding to 0.011–0.63 deg) for translation. The 6 levels of each transformation were spaced logarithmically.

*flatten*or reduce the contrast of the image, we decreased the range of pixel intensities (0–255) in each

*R,*

*G,*and

*B*plane according to the formula

*I*(

*x,*

*y*) is the original,

*I*′(

*x,*

*y*) the transformed image plane, and

*M*the average value of the original image plane.

*k*determined the degree of flattening. The

*k*values spanned 0.05–0.45. To

*brighten*the image, all pixel RGB values were incremented by a specified amount, ranging from 1 to 38. To

*divide*the image, all pixel values were divided by an amount ranging from 1.01 to 1.31. As with the geometric transformations, the 6 levels of each transformation were spaced logarithmically.

*added Gaussian noise*condition, each

*R,*

*G,*and

*B*pixel value (0–255) was perturbed by an amount randomly drawn from a Gaussian probability distribution with mean zero and standard deviation

*σ*equal to

*k,*where

*k*ranged from 1 to 15.

*Multiplicative noise*was achieved by setting

*σ*proportional to the pixel value, i.e.,

*σ*=

*k*·

*I*(

*x,*

*y*), where

*k*ranged from 0.0002 to 0.0019. The

*added fractal noise*images were generated by adding to each

*RGB*image plane a fractal noise mask (Simoncelli, 2003) whose power spectral density fell with spatial frequency

*f*according to

*1*/

*f*

^{n}, with

*n*set to 3 and the image variance normalized to 1. The choice of exponent

*n*= 3 may seem odd because natural scenes have an average exponent of 2 (Field, 1987). However, we measured the spectra of our actual test images and found they had an average exponent of 3.2, and so took 3 rather than 2 for our fractal noise. The steeper-than-normal power spectra of our images is likely caused by the fact that they contained a more than average number of close-ups of objects. The different levels of fractal noise were achieved by multiplying the noise mask by a constant

*k*that varied in logarithmic intervals from 1 to 12.

*E*(between the original and transformed image) was recorded along with the response “correct” or “incorrect.” Although there were 6 discreet levels for each transformation, the computed values of

*E*for each level of a given transformation varied according to the image. In order to fit psychometric functions, the

*E*s were divided into 6 “bins” for each transformation. The first bin was set to have a minimum of zero, while the last, sixth bin was set to have a maximum equal to the maximum

*E*for that transformation. The first bin “divider” was determined iteratively to be the value such that when the remaining bin dividers were logarithmically spaced, the between-bin variance in the number of trials was minimized. This method ensured that the trials were distributed as evenly as possible between bins under the constraint that all except the first bin were logarithmically spaced (because the first bin began at zero). After the

*E*s were binned, the mean log

*E,*proportion correct, and number of trials were calculated for each bin. The psychometric functions relating proportion correct to log

*E*were fitted using the logistic function: 0.5 + 0.5 · exp[(log

*E*−

*a*)/

*b*]/{1+exp[(log

*E*−

*a*)/

*b*]}, where

*a*is the threshold at the 75% correct level and

*b*is the slope. The fitting procedure used a weighting function given by the reciprocal of the binomial standard deviation

*σ*

_{ i}=

*p*

_{ i}and

*N*

_{ i}are the proportion correct and number of trials for the

*i*th log

*E*level.

*E*. The threshold was calculated as the value of log

*E*giving 75% correct (see Methods for details). Threshold

*E*s (note: not log

*E*s) for all transformations are shown in Figure 4a for Experiment 1 and Figure 4b for Experiment 2.

*E,*what the eye sees best is added noise. How much more sensitive our subjects are to added noise can be gleaned from a comparison of the Gaussian noise condition, which had the lowest thresholds, with the average of the geometric transformations, which had the highest thresholds.

*I*

_{B}and transform it, say by rotation, to image

*I*

_{T}. Call the difference between these two images

*I*

_{D}=

*I*

_{T}−

*I*

_{B}. Any difference between two images (even a difference caused by an affine transformation) can be described in terms of this difference image. In the third control experiment, we compare the thresholds for detecting the increment versus the decrement of this difference image. That is we compare the thresholds for

*I*

_{T}is the incremental and

*I*

_{C}the control, decremental image, defined as

*I*

_{T}shows a 1 deg rotation, whereas its counterpart image

*I*

_{C}appears to be edge-sharpened and is clearly much easier to detect. If indeed easier to detect, this suggests that the structure of the difference image is not in itself the main factor producing the relatively high thresholds for the geometric transformations.

*I*

_{C}images, it was necessary to reduce the contrasts of the images by a factor of 3 to prevent pixel values going outside the 0–255 range (underflow/overflow) (see Figure 7). Pilot studies confirmed the impression obtained in Figure 7 that we are much more sensitive to the

*I*

_{C}transformation, and in order to obtain meaningful psychometric functions, we had to make the task more difficult by increasing the viewing distance to 2.8 m. We tested three geometric transformations

*—*rotation, translation, and shear. Two of the authors (AO and FK) served as subjects.

*I*

_{C}transformations are about 4 times lower than their conventional geometric transformation counterparts (note again the logarithmic spacing on the

*y*-axis). This is conclusive evidence that the relatively high thresholds for the geometric transformations are not caused by the statistics of the difference between the baseline and transformed images.

*E*is not an accurate perceptual measure of image difference. However, as a physical measure, it can be used to compare sensitivities across all types of transformation. It is precisely the fact that threshold

*E*s are so different for the geometric and noise transformations that we can appreciate that

*E*is an inadequate predictor of perceived image difference.

*E*for a given unit of transformation (e.g., for a 5 deg rotation), though this fact in itself does not preclude the possibility that

*E*could predict performance better than the unit of transformation itself. Thus, it is conceivable that a different set of images would produce different threshold

*E*s. However, given that we sampled a large number of images with a variety of different types of natural scene, we are confident that had we used a completely different image set, the pattern of results would nevertheless be the same.

*consistent*magnitude and/or phase changes in local wavelet coefficients. It would be interesting to see how well the similarity metric predicts the results of the present study. Wang and Simoncelli point out that small scaling and rotation of images can be locally approximated by translation. This may be the reason why our stretch and shear conditions, which unlike the other geometric transformations distort the images in a less common way, nevertheless produce comparable thresholds. In other words, the process involved in perceptual invariance may be computed over relatively small regions of the image.

*—*they occur whenever we move our bodies or eyes. The uniform photometric transformations would normally arise from physical changes in the scene itself and are thus less frequently experienced. Another possibility is that the visual system prefers not to discard information about uniform photometric changes because they provide important information about the illuminant, which some recent studies have suggested is encoded for the purpose of color constancy (Golz & MacLeod, 2002; Maloney, 2002; Smithson, 2005; Zaidi, 2001).