A number of studies have demonstrated that people often integrate information from multiple perceptual cues in a statistically optimal manner when judging properties of surfaces in a scene. For example, subjects typically weight the information based on each cue to a degree that is inversely proportional to the variance of the distribution of a scene property given a cue's value. We wanted to determine whether subjects similarly use information about the reliabilities of arbitrary low-level visual features when making image-based discriminations, as in visual texture discrimination. To investigate this question, we developed a modification of the classification image technique and conducted two experiments that explored subjects' discrimination strategies using this improved technique. We created a basis set consisting of 20 low-level features and created stimuli by linearly combining the basis vectors. Subjects were trained to discriminate between two prototype signals corrupted with Gaussian feature noise. When we analyzed subjects' classification images over time, we found that they modified their decision strategies in a manner consistent with *optimal feature integration,* giving greater weight to reliable features and less weight to unreliable features. We conclude that optimal integration is not a characteristic specific to conventional visual cues or to judgments involving three-dimensional scene properties. Rather, just as researchers have previously demonstrated that people are sensitive to the reliabilities of conventionally defined cues when judging the depth or slant of a surface, we demonstrate that they are likewise sensitive to the reliabilities of arbitrary low-level features when making image-based discriminations.

_{1}, …,

_{n}from a set of class-conditionally independent cues (i.e., cues

*c*

_{1}, …,

*c*

_{n}that are conditionally independent given the scene parameter of interest so that

*P*(

*c*

_{1}, …,

*c*

_{n}∣

*θ*) = Π

_{i}

^{n}

*P*(

*c*

_{i}∣

*θ*)) consists of taking a weighted average of the individual cue estimates

_{i}

*ω*

_{i}

_{i}(where

*ω*

_{i}∝ 1/

*σ*

_{i}

^{2}). Researchers have found that, across a variety of perceptual tasks, human observers seem to base their perceptual judgments on just such a strategy. While most of these cue integration studies have focused on strategies used by observers in stationary environments, several (Atkins et al., 2001; Ernst, Banks, & Bülthoff, 2000; Jacobs & Fine, 1999) have investigated how observers change their cue integration strategies after receiving training in virtual environments in which a perceptual cue to a scene variable is artificially manipulated to be less informative with respect to that variable. In one of these studies, Ernst et al. (2000) manipulated either the texture- or disparity-specified slant of a visually presented surface to indicate a slant value that was uncorrelated with the haptically defined orientation of the surface. The authors found that after receiving training in this environment, subjects' perceptions of slant changed such that, in a qualitatively similar fashion to the ideal observer, they gave less weight to the slant estimate of the now less reliable visual cue.

*image-based*(i.e., rather than 3D scene-parameter-based) discriminations. To characterize the learning obtained in such tasks using the ideal observer cue combination framework described above, we must first deal with several conceptual and methodological issues. The first of these issues concerns the seemingly disparate nature of 3D cue combination tasks on the one hand, and simple image-based discrimination tasks on the other. Consider for example the slant discrimination task described in the previous paragraph. In this case, the slant of the surface is defined visually by two conventional and well-understood cues to surface slant: texture foreshortening, and binocular disparity. In a texture discrimination task, however, subjects are not trying to determine the value of some surface parameter such as slant. Instead, they must determine to which of two arbitrarily defined categories a presented texture belongs. What are the cues in this task? Of course, these textures will differ along some set of image features and the subject can identify and use these features as “cues” to the texture category. But do such features function as cues in the same sense as texture foreshortening and binocular disparity? The current study was designed to address this question. We were interested in determining whether the optimal integration of cues described in cue combination studies such as that of Ernst et al. (2000) is a special property of the limited set of conventionally defined visual cues (e.g., texture compression and disparity gradient cues for slant) or whether people are likewise sensitive and capable of exploiting the relative reliabilities of arbitrarily defined cues such as those of low-level features involved in image-based discriminations.

*optimal feature combination*(i.e., in a manner analogous to optimal cue combination). In both experiments, subjects viewed and classified stimuli consisting of noise-corrupted images. The stimuli used in each experiment were generated within a 20-dimensional feature space whose noise covariance structure was varied across conditions. In Experiment 1, subjects were trained to discriminate between two stimuli corrupted with white Gaussian feature noise, and their classifications were calculated over time. When we examined their classification images, we found that, with practice, their classification images approached that of the ideal observer. In addition, this improvement in their classification images correlated highly with their increase in performance efficiency, accounting for most of the variance in their performance. In Experiment 2, the variance of the corrupting noise was made anisotropic, such that some features were noisier and thus less reliable in determining the stimulus class than others. In the first half of the experiment, half of the features were made reliable and the other half unreliable. In the second half of the experiment, this relationship was reversed so that the features which had heretofore been reliable were now unreliable and vice versa. When we examined the classification images calculated for each subject over time, we found that they modified their decision strategies in a manner consistent with optimal feature combination, giving higher weights to reliable features and lower weights to unreliable features. The results of Experiment 1 suggest that subjects' learning in these texture discrimination tasks consists primarily of improvements in the optimality of their discriminant functions, while the results of Experiment 2 suggest that in learning these discriminant functions, subjects are able to exploit information about the reliabilities of individual features.

*classification image,*used by human observers performing a binary perceptual discrimination task. To discover this template

*T*for an individual observer, the researcher adds random pixel noise

**ε**^{(t)}∼

*N*(0,

**I**) to the signal

**s**

^{(t)}∈ {

**s**

_{0},

**s**

_{1}} presented on each trial

*t*. The researcher can then calculate the observer's classification image by simply correlating the noise added on each trial with the classification

*r*

^{(t)}∈ {−1, 1} indicated by the observer. These classification images reveal the stimulus components used by observers in making perceptual discriminations. Over the past decade, this classification image technique has proven quite useful; researchers have used this technique (or variants thereof) to determine the templates used by observers in a variety of different tasks (e.g., Abbey & Eckstein, 2002; Ahumada, 1996; Levi & Klein, 2002; Lu & Liu, 2006), to compare these observer classification images to those calculated for an ideal observer (optimal templates), and to investigate how these classification images change with learning (e.g., Beard & Ahumada, 1999; Gold, Sekuler, & Bennett, 2004). Despite these successes, the method does suffer from some shortcomings.

^{2}regression coefficients plus a bias term). Consequently, researchers require thousands of trials to obtain a reasonable classification image for a single observer, and the correlation of the resulting images with the optimal templates is generally quite low due to the poor sampling of the stimulus space and the concomitant paucity of data points (Gold et al., 2004). Several researchers have attempted to remedy this problem and to boost the significance of such comparisons by restricting the final analysis to select portions of the classification image (e.g., Gold et al., 2004) by averaging across regions of the image (e.g., Abbey & Eckstein, 2002; Abbey, Eckstein, & Bochud, 1999) or by using a combination of these methods (e.g., Chauvin, Worsley, Schyns, Arguin, & Gosselin, 2005). Such measures work by effectively reducing the dimensionality of the stimulus space so that instead of calculating regression coefficients for each pixel, researchers calculate a much smaller number of coefficients for various linear combinations of pixels. Essentially, these researchers add the signal corrupting noise in pixel space but perform their analyses in terms of a lower dimensional basis space.

*a priori*.

^{1}In addition to its simplicity, this approach has several advantages over traditional methods. First, by specifying the bases in advance, we can limit the added noise

*to the subspace spanned by these bases, ensuring that (1) the noise is white and densely sampled in this subspace, and (2) only features within the spanned subspace contribute to the observer's decisions (i.e., because all stimulus variance is contained within this subspace). Second, because we specify the bases in advance, we can select these bases in an intelligent way, representing only those features that observers are likely to find useful in making discriminations, such as those features that contain information relevant to the task (i.e., features that vary across the stimulus classes).*

**ε**^{2}Finally, this approach makes it possible to manipulate the variance of the noise added to different features and thus to vary the reliabilities of these features. This allows us to investigate how observers combine information from different features using methods similar to those that have been used in studying perceptual cue combination.

**g**

^{(t)}represent the stimulus presented on trial

*t*. Ahumada's technique generates these stimuli as

**s**

^{(t)}and

**ε**^{(t)}are defined as above. If we explicitly represent the use of pixels as bases using the matrix

**P**, whose columns consist of the

*n*-dimensional set of standard bases, we can rewrite Equation 1 in a more general form as

**P**is equivalent to the identity matrix

**I**

_{ n}. It should be clear, however, that by applying the appropriate linear transformation

*T*:

**P**→

**B**to the stimuli

**s**

^{( t)}, we can exchange

**P**for an arbitrary basis set

**B**to generate stimulus images in the space spanned by

**B**. This is represented by our generative model

*μ*^{( t)}∈ {

*μ*_{A},

*μ*_{B}} represents a prototype stimulus

**s**expressed in terms of the basis set, and

**η**^{( t)}∼

*N*(0,

**I**) represents Gaussian noise added in the basis space. (Note that when

**B**=

**P**, Equation 3 is equivalent to Equation 2, with

*μ*^{( t)}=

**s**

^{( t)},

**k**= 0, and

*η*^{( t)}and

**ε**^{( t)}distributed identically.) The only new term is the constant vector

**k**, which is important here because it provides additional flexibility in choosing the bases that make up

**B**.

^{3}In particular, this constant term allows us to represent constant (noiseless) features in pixel space that do not exist in the space spanned by

**B**. Figures 1 and 2 illustrate this generative model for a pair of example stimuli. Here the task requires classifying a presented stimulus as an instance of stimulus A (square) or stimulus B (circle). All of the information relevant to this discrimination lies in the difference image (the rightmost image in Figure 1). The image shown to the left of this difference image (third from left) represents the part of the stimulus that remains constant across stimulus classes. Representing this part of the stimulus as

**k**allows us to focus on selecting bases

**B**that can adequately represent the difference image. Figure 2 shows example stimuli

**g**generated for this task using the models described in Equation 2 (top of Figure 2) and Equation 3 (bottom of Figure 2).

*P*(

*C*

_{ i}) and likelihood functions

*P*(

**x**∣

*C*

_{ i}) for both stimulus classes

*C*

_{ i},

*i*∈ {A, B}. Using Bayes' rule, the probability that an image

**x**belongs to class A is

*f*(

**x**) in Equation 6 is linear in

**x**. The stimuli presented on each trial are drawn from a multivariate Gaussian representing one of the two signal categories. Therefore, we can express the likelihood terms in Equation 6 as

*μ*_{ i},

*i*∈

*A,*

*B*is the mean (prototype) for class

*i,*

**Σ**is the common covariance matrix for both classes, and

*m*is the dimensionality of the stimulus space. Plugging these likelihoods into Equation 6 yields

*f*(

**x**) is indeed linear in x:

**Σ**=

*σ*

^{2}

**I**) the optimal template is proportional to the difference between the signal category prototypes. Note also the similarity of Equation 10 to the result

*w*

_{ i}∝ 1/

*σ*

_{ i}

^{2}from optimal cue combination. We exploit this relationship in the design of Experiment 2.

^{2}matched the mean luminance of the images. All of the stimuli were constructed as linear combinations of the set of basis “features” illustrated in Figure 3.

*i,*image

*i*was modified via gradient descent to be maximally smooth and to be orthogonal to images 1 through (

*i*− 1). These orthogonality constraints interacted with the smoothness constraint to produce images that were localized in spatial frequency content such that the first bases produced by our method contained low frequencies and subsequently added bases contained increasingly higher frequencies. We randomly selected twenty of the 50 images to form the basis set that we used to construct our stimuli.

^{4}

*μ*_{B}= −

*μ*_{A}). To obtain images of the prototypes, these vectors were multiplied by the matrix representing the 20 basis features and a constant image was added, consisting of the mean luminance plus an arbitrary image constructed in the null space of the basis set (the addition of this arbitrary image prevented the prototypes from appearing simply as contrast-reversed versions of the same image). Finally, the prototypes were upsampled to yield 256 × 256 pixel images. We created only one set of prototypes and all subjects saw the same set ( Figure 4).

*η*^{( t)}. The noise masks, like the prototypes, were generated as a linear combination of the basis features. However, for the noise masks, the linear coefficients were sampled from a multivariate Gaussian distribution

*η*^{( t)}∼

*N*(0,

*σ*

^{2}

**I**). Values that deviated more than 2

*σ*from the mean were resampled. The RMS contrast of the signal and the noise mask were held constant at 5.0% and 7.5%, respectively.

Subject | r( df) | p |
---|---|---|

WHS | r(10) = 0.7657 | <0.005 |

RAW | r(10) = 0.8518 | <0.001 |

BVR | r(10) = 0.8126 | <0.005 |

SKL | r(10) = 0.3745 | >0.05 |

- Can subjects learn to discriminate texture stimuli generated in our basis space?
- How well do improvements in discrimination performance correlate with the optimality of an observer's classification image?
- How efficient is our method? That is, how many trials are required to estimate a subject's classification image?

*d*′ in each session with the total number of trials completed at the end of that session.

**w**

_{obs}

^{ T}

**w**

_{ideal}/∣∣

**w**

_{obs}∣∣ ∣∣

**w**

_{ideal}∣∣) between the subject's classification image

**w**

_{obs}and that of the ideal observer

**w**

_{ideal}across time. Normalized-cross-correlation is often used to represent the degree of “fit” between two templates (e.g., Gold et al., 2004; Murray, 2002). The “fit” in this case is indicative of the optimality of the template used by a particular subject, and we thus refer to the square of the normalized cross-correlation as the subject's

*template efficiency*(Figure 6, dashed curve). We also calculated subjects' discrimination efficiencies [

*d*′

_{obs}/

*d*′

_{ideal}]

^{2}(Geisler, 2003) for each session to compare the performances of subjects to that of the ideal observer. Finally, we correlated each subject's discrimination and template efficiencies across sessions to measure how improvements in discrimination performance correlate with improvements in the optimality of the subject's classification image. The resulting correlation coefficients and significance statistics appear at the top of the plots in Figure 6. The correlations are quite strong, indicating that increases in subjects' discrimination efficiencies are well explained by the observed improvement in their templates. This finding corroborates a qualitatively similar finding by Gold et al. (2004).

^{5}can be accounted for by improvements in their classification images, so that changes in subjects' discrimination strategies over time can largely be characterized by calculating their classification images. Together, these characteristics indicate that our method is suitable for determining how observers change their discrimination strategies as a perceptual task is modified.

*optimal feature combination*(i.e., in a manner analogous to optimal cue combination). We investigated this question by manipulating the reliabilities of different features with respect to discrimination judgments like those made by subjects in Experiment 1. Changes made to the relative reliabilities of different features result in corresponding changes to the optimal decision template. By calculating the classification images used by subjects across such manipulations, we can determine whether observers are sensitive to the reliabilities of individual features and modify their templates accordingly. The idea, illustrated in Figures 7B and 7C, is to change the optimal template across two phases of the experiment by modifying only the variance structure of the noise. If observers use information about feature variance in performing discrimination tasks, then we should observe a change in their classification images between the first and the second phases of the experiment. After the transition, observers' templates should move away from that predicted by the optimal template for the first set of reliable versus unreliable features, and toward that predicted by the optimal template for the second set. We expected that subjects would take feature reliabilities into account when making discriminations, resulting in classification images that give greater weight to reliable features and lower weight to unreliable features.

**Σ**was not the identity matrix. Observers performed 24 sessions of 300 trials each over 6 days.

*b*

_{ i}by manipulating its variance

*σ*

_{ i}

^{2}in the noise covariance matrix

**Σ**. Equation 10 establishes the relationship between the noise covariance and the optimal template. Exploiting the facts that

**Σ**is a diagonal matrix and that

*μ*_{B}= −

*μ*_{A}, we can express the individual elements of

**w**as

*σ*

_{ i}

^{2}represents the

*i*th diagonal element of

**Σ**. Note that this is similar to the result obtained for optimal weighting of independent cues in the literature on cue combination (e.g., Landy, Maloney, Johnston, & Young, 1995; Yuille & Bülthoff, 1996). The difference here is that instead of simply weighting each feature in proportion to its reliability (i.e., inverse variance), there is an added dependency on the class means, such that observers must weight each feature in proportion to its mean-difference-weighted reliability. In the current study, we removed this dependency by choosing the elements in

_{A}such that their magnitudes are all equal (i.e., ∣

_{Ai}∣ = ∣

_{Aj}∣∀

*i,*

*j*≤

*m*) so that the weights composing the optimal template are indeed inversely proportional to the variances of their associated features.

^{6}Figure 7 illustrates this dependency for a simple stimulus space consisting of two feature dimensions

*x*1 and

*x*2.

*σ*

_{unreliable}= 5 while

*σ*

_{reliable}= 1). In the second half of the experiment, the roles of these two sets of features were swapped such that the reliable features were made unreliable and the unreliable features were made reliable. Importantly, the set of reliable and unreliable features were chosen randomly for each subject so that the pair of covariance matrices for the first (

**Σ**

_{1}) and second (

**Σ**

_{2}) halves of the experiment were unique to a subject.

**Σ**

_{1}and

**Σ**

_{2}, respectively. Thus, we defined two optimal templates for each subject; one for the generative model used in sessions 1–12 (

**w**

_{ideal1}, appropriate for

**Σ**

_{1}) and one for the generative models used in sessions 13–24 (

**w**

_{ideal2}, appropriate for

**Σ**

_{2}). Figure 8 plots the normalized-cross-correlation between the calculated classification image

**w**

_{obs}and the templates

**w**

_{ideal1}(solid lines) and

**w**

_{ideal2}(dashed lines) for each of the four subjects as a function of the number of trials. Figure 9 displays the visible change in the classification images used by subjects between the first and second halves of the experiment.

**w**

_{1}when the noise covariance structure was defined by

**Σ**

_{1}and modifying their templates to more closely match

**w**

_{2}during the second half of the experiment, when the covariance structure was defined by

**Σ**

_{2}. To quantify these results, we compared the average difference between the template fits wfit

_{2}− wfit

_{1}(where wfit

_{ i}represents the normalized-cross-correlation between template

**w**

_{ i}and a subject's classification image) across the first and second halves of the experiment using a

*t*-test. These differences are plotted in Figure 10 and the corresponding significance statistics displayed in Table 2.

Subject | t( df) | p |
---|---|---|

DLG | t(5) = −5.7661 | <0.005 |

JDG | t(5) = −3.3911 | <0.05 |

MSB | t(5) = −13.3369 | <0.0001 |

MKW | t(5) = −27.4861 | <0.00001 |

*optimal feature combination*(i.e., in a manner analogous to optimal cue combination). In both Experiments, subjects viewed and classified stimuli consisting of noise-corrupted images. The stimuli used in each experiment were generated within a 20-dimensional feature space whose noise covariance structure varied across conditions. In Experiment 1, subjects were trained to discriminate between two stimuli corrupted with white Gaussian feature noise and their classifications were calculated over time. Examination of their classification images reveals that, with practice, their decision templates approached that of the ideal observer. Moreover, this improvement in their classification images correlated highly with their increase in performance efficiency, accounting for between 65% and 80% of the variance in their performance. Consistent with the findings of Gold et al. (2004), these results suggest that the learning demonstrated in these perceptual discrimination tasks consists primarily of observers improving their discriminant functions to more closely match the optimal discriminant function.

*normative*approach to modeling what observers learn through practice with a perceptual discrimination task. This approach focuses on the structure of the task that an observer must solve, on the relevant information available to the observer, and on the fundamental limits that these factors place on the observer's performance. In contrast to process-level models of perceptual learning (e.g., Bejjanki, Ma, Beck, & Pouget, 2007; Lu & Dosher, 1999; Otto, Herzog, Fahle, & Zhaoping, 2006; Petrov, Dosher, & Lu, 2005; Teich & Qian, 2003; Zhaoping, Herzog, & Dayan, 2003) the normative approach used here is largely agnostic with respect to either physiological or algorithmic implementation details (Marr, 1982). Our results demonstrate that people can learn to use information about the covariance structure of a set of arbitrary low-level visual features. We leave the question of how this learning is implemented in the brain as a problem for future work.

^{1}Several researchers (e.g., Olman & Kersten, 2004; Li, Levi, & Klein, 2004) have previously introduced lower-dimensional methods for calculating classification images (or classification objects). Note however that the approaches used in these papers differ from the approach used in the current paper in that they obtain this reduction in dimensionality by assuming that observers have direct access to geometric scene configurations rather than to the photometric input (e.g., pixel intensities) that subjects actually observe. In Li et al. (2004), the authors implicitly assume that observers have direct access to an array whose entries represent the positions of the elements making up a Vernier stimulus and that they make decisions based on this vector of positions rather than on the pattern of luminances within the image. Similarly, Olman and Kersten (2004) assume that observers have direct access to variables describing the geometry of the scene (e.g., foot spread, tail length, tail angle, neck length). In these two studies, the stimuli are defined directly in terms of scene variables—though subjects in fact observe these variables through images—and the resulting classification images are linear in the geometrical object space, but not in image space. These approaches may be more useful than image-based approaches for investigating how observers make discriminations in tasks involving representations of three-dimensional scenes (as in Olman & Kersten, 2004) when researchers have an adequate understanding of the internal representations used by observers.

^{2}Simoncelli, Paninski, Pillow, and Schwartz (2004) provide an extended discussion regarding the importance of stimulus selection in the white noise characterization of a signal processing system. Though they are concerned in particular with characterizing the response properties of neurons, their points apply equally well to the challenges involved in characterizing the responses of human observers in a binary discrimination task. Olman and Kersten (2004) provide a related discussion that proposes extending noise characterization techniques to deal with more abstract (i.e., non-photometric) stimulus representations.

^{3}The constant

**k**is used to represent any constant component of the image. In fact, because luminance values cannot be negative, traditional approaches to classification images implicitly include a k in the form of a mean luminance image (e.g., a vector of identical positive pixel luminance values).

^{4}Contrast sensitivity functions were not measured directly for each subject. Instead, for the sake of expediency, we used the model of human contrast sensitivity proposed by Mannos and Sakrison (1974), which describes the sensitivity of a human observer, generically, as

*A*(

*f*) = 2.6(0.0192 + 0.114

*f*)

*e*

_{−(0.114f)1.1}.

^{5}These estimates of explained variance are obtained using the correlation between the normalized cross correlations (

**w**

_{obs}

^{ T}

**w**

_{ideal}/∣∣

**w**

_{obs}∣∣ ∣∣

**w**

_{ideal}∣∣) and the sensitivity ratio [

*d*′

_{obs}/

*d*′

_{ideal}]

^{2}. Unlike in Figure 6, these values were not squared. Squaring the sensitivity measure is necessary for an information-theoretic interpretation of efficiency, but removes information about some of the correlation between observers' template fits and sensitivities (e.g., classification images that point in the wrong direction yield sensitivities below zero). The r

^{2}values resulting from this correlation are: 0.80 (BVR), 0.76 (RAW), 0.66 (WHS), and 0.81 (SKL).