One major challenge in the sensory sciences is to identify the stimulus features on which sensory systems base their computations, and which are predictive of a behavioral decision: they are a prerequisite for computational models of perception. We describe a technique (decision images) for extracting predictive stimulus features using logistic regression. A decision image not only defines a region of interest within a stimulus but is a quantitative template which defines a direction in stimulus space. Decision images thus enable the development of predictive models, as well as the generation of optimized stimuli for subsequent psychophysical investigations. Here we describe our method and apply it to data from a human face classification experiment. We show that decision images are able to predict human responses not only in terms of overall percent correct but also in terms of the probabilities with which individual faces are (mis-) classified by individual observers. We show that the most predictive dimension for gender categorization is neither aligned with the axis defined by the two class-means, nor with the first principal component of all faces—two hypotheses frequently entertained in the literature. Our method can be applied to a wide range of binary classification tasks in vision or other psychophysical contexts.

*reverse correlation*approach exist, for an overview see Wu, David, and Gallant (2006). In psychophysics, noise-based reverse correlation techniques are known as

*classification image*methods (Ahumada & Lovell, 1971; Beard & Ahumada, 1998; Eckstein & Ahumada, 2002; Knoblauch & Maloney, 2008; Ross & Cohen, 2009). In cases where classification images adequately predict behavior, they can also give insights into the neural mechanisms involved in a psychophysical task (Nienborg & Cumming, 2009). To stress the similarity of this feature identification approach to reverse correlation in neurophysiology, some authors also refer to this approach as

*perceptive field*estimation (Neri & Levi, 2006). In addition to classification image techniques using (white) noise, a variant of reverse correlation termed the

*bubbles technique*recently gained some popularity in the vision sciences (Dupuis-Roy, Fortin, Fiset, & Gosselin, 2009; Gosselin & Schyns, 2001; Murray & Gold, 2004; Schyns, Gosselin, & Smith, 2009).

*clearly visible*features analyzed led us astray. Finally, while the bubbles-technique localizes important regions in a stimulus, it does not yield a predictive model which can be used to predict human responses on novel stimuli.

*decision images*(Kienzle, Franz, Scholkopf, & Wichmann, 2009; Wichmann, Graf, Simoncelli, & Schoelkopf, 2005) (Yovel, Franz, Stilz, & Schnitzler, 2008, in the auditory domain). The central idea is to train a statistical classification algorithm on the responses provided by human subjects in a psychophysical task (here: perceived gender) instead of ground truth (real gender). In this way, we will obtain decision images which resemble the internal decision space of individual human subjects, rather than create a decision space that optimally separates the physical stimulus classes, as is done when training learning machines on problems in engineering (Gray et al., 1995; Moghaddam & Yang, 2002). Clearly, if human observers perform above chance-level, the extracted decision boundaries will be correlated with the (physically) optimal decision boundaries. We perform additional analyses to show that the extracted decision models do not merely reflect the optimal features, but rather observer-specific decision boundaries.

*n*stimuli

*s*

_{ i}—typically an image in visual psychophysics, but equally a sound in auditory psychophysics—is presented multiple times (

*m*presentations). At each presentation

*j,*the observer responds with a binary decision

*y*

_{ ij}∈ {1, −1}, depending on whether they perceive the stimulus to be belonging to class −1 or 1—female or male in case of the experiments reported herein. We assume decisions to be stochastic, i.e. that observers respond with probability

*P*(

*Y*

_{ i}= 1∣

*s*

_{ i}) = :

*p*(

*s*

_{ i}). Statistically speaking, we assume each trial of the experiment to be an independent Bernoulli trial, an assumption that is widespread in psychophysics (e.g. Wichmann & Hill, 2001a). Here we estimate the decision probabilities by using the empirical estimate

*s*

_{i}) =

*j*:

*y*

_{ij}= 1}, but our method can also be applied in situations where the

*s*

_{i})'s are obtained in a different manner, e.g. by explicit uncertainty estimates provided by the observers. The use of

*m*multiple presentations allows for the quantitative assessment of the resulting model, i.e. how much of the variability in the human responses is accounted for Murray, Bennett, and Sekuler (2002).

*s*

_{ i}, we want to obtain a prediction of the corresponding

*p*(

*s*

_{ i}). This task is analogous to fitting psychometric functions (Wichmann & Hill, 2001a), with the difference that here

*s*is high-dimensional rather than univariate. We assume that the relationship between the stimulus and the decision probability can be represented by the combination of a linear filter

*ω*and a static nonlinearity

*f,*i.e. by a model of the form

*ω*is a vector of the same dimensionality as

*s*. (If the stimulus

*s*is an image of arbitrary dimension

*K*×

*Z*(as in visual psychophysics) we simply re-shape

*s*to be a vector of dimension

*KZ*× 1.) The scalar

*β*is a bias term which can be used to shift the decision boundary

*p*(

*s*) = 0.5 closer to one class or the other. To calculate the predicted classification probability, we take an inner product between the stimulus

*s*

_{ i}and

*ω,*add

*β,*and apply a static nonlinearity

*f*to the result. In our psychophysical application we refer to

*ω*as the

*decision image,*in neurophysiology it would be referred to as the receptive field. In psychophysics, the nonlinearity

*f*is a psychometric function which maps the real-valued filter outputs

*z*

_{ i}=

*ω*

^{⊺}

*s*

_{ i}+

*β*to probabilities in the range [0,1]. In neurophysiology,

*f*could be used to model nonlinear spike-generation mechanisms. All stimuli

*s*which lie in the hyperplane given by the equation

*ω*

^{⊺}

*s*+

*β*=

*z*for some

*z*will have the same decision probability

*p*(

*s*) =

*f*(

*z*). In particular, the decision boundary

*p*(

*s*) = 0.5 is the hyperplane

*ω*

^{⊺}

*s*+

*β*=

*f*

^{−1}(

*p*(

*s*

_{ i}) is thus a function of the (signed) distance of

*s*

_{ i}to this decision hyper-plane: Moving stimuli perpendicular to the decision boundary will increase (or decrease) the associated decision probabilities, whereas moving in parallel will keep them constant.

*ω*—the weight vector or

*decision image*—and the function

*f*such that the predicted probabilities

*p*(

*s*

_{ i}) are as close as possible to the empirical probabilities

*s*

_{ i}), a task analogous to psychometric function fitting. As optimizing both the decision image and the nonlinearity

*f*together would be a non-convex optimization problem, we will separate the two stages in a way that is analogous to techniques for estimating receptive fields in neurophysiology (Sharpee, Miller, & Stryker, 2008): First, we estimate the decision image

*ω*under the assumption that the nonlinearity is the logistic function, and then optimize the nonlinearity in a second step.

*ω*along which the two classes vary most can be regarded as a classification problem, as in Wichmann et al. (2005): Linear algorithms, such as support vector machines, can be used to find the direction of maximal separation between the two sets of stimuli, here female and male faces. Unlike in Wichmann et al. (2005), however, we present each stimulus

*m*times, and thus we have not only the observer's binary decision for each stimulus, but also an estimate of the decision probability

*s*

_{i}) over the

*m*repetitions. Hence, we can interpret the estimation step as a regression from

*s*

_{i}against

_{i}: we move from binary classification as in Wichmann et al. (2005) to probability estimation. One standard algorithm for this task is penalized logistic regression, which finds

*ω*by minimizing the objective function

*f*is the logistic function

*l*(

*t*) = (1 + exp(−

*t*))

^{−1}, where

*t*is the signed distance to the separating hyperplane. Although we do not restrict

*f*to be of this form, psychometric functions can often be well approximated by such a shape. Importantly, the cost function of logistic regression is convex. Therefore, the optimization problem has a unique solution, a single, global optimum which can be found efficiently. The loss function in Equation 2 is linear in the probabilities (

*classification with repeated stimuli*and

*regression onto probabilities*are mathematically equivalent.

*η*∣∣

*ω*∣∣

^{2}is used to avoid overfitting, which is important in high-dimensional regression problems (Hofmann, Schölkopf, & Smola, 2008). In a Bayesian setting, maximization of the penalized log-likelihood can be interpreted as finding the maximum-a-posterior estimate (MAP) for

*ω*: In this case, the regularization term is the log of the prior distribution over parameters

*ω,*i. e. a prior of the form exp(−∣∣

*ω*∣∣

^{2}

*η*) is assumed. In any case, choice of the regularizer should be guided by prior knowledge about the problem. Often, however, explicit domain knowledge is unavailable, or unreliable, and then the regularization term is used to enforce smoothness (Knoblauch & Maloney, 2008) or sparseness (Gerwinn, Macke, Seeger, & Bethge, 2008; Mineault, Barthelme, & Pack, 2009) of the filters

*ω*. Note that in this case the regularization term acts very much like the penalty terms in model selection (Jäkel, Schölkopf, & Wichmann, 2007). Applying our decision image estimation technique to gender classification of faces, we simply chose the standard regularization ∣∣

*ω*∣∣

^{2}= ∑

_{i}

*ω*

_{i}

^{2}. It is possible that more informed choices of the regularization term may lead to even better results (Knoblauch & Maloney, 2008). The regularization parameter

*η*which determines the trade-off the goodness-of-fit of the data with the penalty for overly complex models was chosen by ten-fold cross-validation. This means that we repeatedly fitted the model on 90% of the data, and took those parameters which, on average, lead to the best performance on the remaining 10% of the data.

*ω*which minimizes the function in Equation 2 (where the summation is over the entire training set) will always be in the subspace spanned by the training stimuli

*s*

_{ i}, the so-called representer theorem (Hofmann et al., 2008). Thus, rather than working in the full space of

*K*×

*Z*-dimensional stimuli, we can perform an (orthogonal) change of variables and work in a

*n*dimensional space (if there are

*n*stimuli): We apply a PCA to the full set of stimuli, but keep all

*n*principal components, and thus still work in the space spanned by all

*n*stimuli: We use PCA to increase computation efficiency by dimensionality reduction (

*n*= 428;

*K*×

*Z*= 65536), but not for it to have any influence on the performance of the algorithms (Abdi et al., 1995). (An orthogonal change of variables does not affect any linear algorithm, as it leaves the inner products between any two vectors unchanged.)

*s*

_{ i}are high-dimensional compared to the number of datapoints (

*K*×

*Z*≫

*n*) it is always possible to perfectly fit the model to the data. Consequently, mere goodness of fit can not be used as a performance measure. Instead, we directly tested how well the model can

*predict*data the model was not trained on. Therefore, to obtain a realistic estimate of the generalization performance of the model, we use cross-validation: The model is fitted to a subset of the data, and then the performance is evaluated on a different and disjoint subset. We used leave-one-out estimates: To predict the output of the model to the stimulus

*s*

_{ i}, which we call

*z*

_{ i}, we trained it on all other stimuli

*but s*

_{ i}. All values of

*z*

_{ i}in this article, including those in the figures, are cross-validated predictions. Thus, we have two levels of cross-validation: We use 10-fold cross validation to

*find*the regularization parameter

*η,*and we use leave-one-out estimates to

*report*predictions (and thus evaluate how well our decision images really predict human behavior) (Bishop, 2006).

*ω*under the assumption that the nonlinearity is the logistic function

*l*(

*s*). In practice, however, it often leads to superior performance to re-adjust the nonlinearity after identification of the filters: Having estimated

*ω,*we can calculate the scalar products

*z*

_{ i}=

*ω*

^{⊺}

*x*

_{ i}and determine

*f*by fitting a function to the set of points (

*z*

_{1},

_{1})…(

*z*

_{ n},

_{ n}). Assuming each subject's response to be the outcome of a Bernoulli trial, we find

*f*by maximizing the log-likelihood

*asymptotic error probabilities*are fitted simultaneously (Wichmann & Hill, 2001a). Therefore, the fit of the pointwise nonlinearity involves a total of 4 free parameters (two for the function proper, two for the asymptotes), as well as selection of the best fitting function type. Simultaneous fitting of both the filter

*ω*and the asymptotes did not lead to any gain in prediction performance, most likely because such an optimization is non-convex: the theoretical advantage of estimating the decision image

*ω*and the asymptotes directly together is more than offset by the disadvantage of non-convex optimization.

^{1}

*prototype*classifier, and this is a very popular model for face perception (Burton, Jenkins, Hancock, & White, 2005; Leopold, Bondar, & Giese, 2006; Loffler, Yourganov, Wilkinson, & Wilson, 2005). The prototype classifier can be cast in our linear–nonlinear cascade framework. Its filter

*ω*is simply given by

*ω*

_{prot}=

_{ij}

*y*

_{ij}

*s*

_{i}, i.e. the difference of the two class means.

*ω*by the inverse of the within-class covariance

*C*

_{ within}, yielding the filter

*ω*

_{ lda}=

*C*

_{ within}

^{−1}

*ω*

_{ prot}. This algorithm (Lu et al., 2003) is known as (Fisher's) linear discriminant analysis in statistics (LDA in the following). In psychology, it has been noted several times that it is important to take the variances (and co-variances) into account when learning a category (Fried & Holyoak, 1984; Reed, 1972). This effectively results in categorization models which are related to LDA. LDA can be shown to be an optimal classifier, provided that the stimuli in each of the two classes are Gaussian, have the same within-class covariance, and the means and covariances are known exactly. In practice, however, we do not know the exact means and the covariance, and we have to estimate both from the data. Misestimation of the covariance could result in worse classification performance then the simple prototype classifier. On small data sets it is thus advisable to use a regularized covariance estimate

*C*

_{reg}=

*C*

_{within}+

*η*

**1**, where

**1**is an identity-matrix. We optimize the regularization

*η*by cross-validation, as we did for logistic regression.

*ω*can be found by minimizing a different convex cost function in place of the logistic-regression cost function from Equation 2. For example, the support vector machine (SVM) algorithm finds a decision boundary by minimizing the so-called hinge-loss (Hofmann et al., 2008). The SVM is a general purpose machine learning algorithm which has successfully been applied in a variety of domains, and has also been used previously for the estimation of decision images (Wichmann et al., 2005). We modified the original SVM-algorithm (which performs binary classification) in order to also take into account the estimated decision probabilities

*s*

_{i}) (see 17 for derivation and discussion of our modified algorithm).

*ω,*and then fitted the nonlinearity to the residuals as described above.

^{2}173 of these 428 were obtained by morphing pairs of randomly selected faces of the same gender using methods described in Blanz and Vetter (1999). The morphs were linear in texture-coordinates, but not in pixel-space.

^{2}for conditions 1 to 3 and 181 cd/m

^{2}for condition 4. The temporal envelope of stimulus presentation was a modified Hanning window (a raised cosine function with rise and fall times of 200 ms and a plateau time of 800 ms). Faces were presented in random order and subjects were instructed to respond quickly; the median response time was 647 ms with a standard deviation of 190 ms after stimulus onset, i.e. on the vast majority of trials subjects responded whilst the stimuli were still on screen. No feedback as to whether the response was correct was provided. At the viewing distance of 60 cm the stimuli (nominally) subtended 9.5 degrees of visual angle. Seven observers with normal or corrected-to-normal vision who were naive to the purpose of the experiment acted as experimental subjects; they were paid for their participation. Each subject categorized each of the 428 faces per condition 10 times, for a total of 4280 trials per subject and condition, i.e. 17 120 trials per subject in total and a grand total of 119 840 trials reported in this study. Stimuli were presented in blocks of (about) 500 stimuli from each condition, and the order of blocks and stimuli within each block was randomized.

*SEM*1.80%).

*SEM*2.9%), on a balanced stimulus set. Consistent with previous studies (Troje & Bulthoff, 1996, and references therein), observers were slightly better at discriminating the faces which were shown slightly from the side (condition 2, pc = 84%,

*SEM*2.0% than when presented in frontal view (condition 1, pc = 81%,

*SEM*2.3%). Each of the observers performed better on condition 2 than on condition 1 (binomial test, 7 out of 7,

*p*< 0.01). The stimuli in conditions 2 and 3 were identical except for the fact that the faces in condition 3 were rendered with their original texture, whereas an uninformative texture was used in condition 2. Not surprisingly, the texture-cue helped gender-discrimination performance, with the average up to pc = 90%,

*SEM*1.2%, in this condition, and all observers performing better in this condition than in condition 2 (7 out of 7,

*p*< 0.01). Condition 4 consisted of faces which where presented slightly from the side, had a neutral texture, and in which the shadows caused by the spotlight illumination removed some potential cues. This condition proved to be of intermediate difficulty, with an average performance level between those of conditions 2 and 3 (pc = 85%,

*SEM*1.7%).

*SEM*0.007). The hypothesis of independence could thus be rejected (G-test (Kullback, 1997),

*p*< 0.01 for each observer). Second, observers had a tendency to make errors on the same stimuli, indicating that some faces were more difficult to categorize than others. We correlated the response probabilities between observers for female and male faces separately. The mean rank-correlation between the error-probabilities of any two observers was 0.27 for the female, and 0.57 for the male faces. The correlations between any two observers were significant (

*p*< 0.05) on the male faces for each pair of observers in each condition, and on 95% of pairs for the female faces (158/168 pairs). Thus, it is clear that some faces are more difficult to categorize than others, and that there is a significant amount of agreement amongst observers about which of the stimuli are difficult. However, we will show below that there are also significant differences between observers, and that the decision image technique's sensitivity is high enough to identify those differences.

*s*and decision probabilities

*p*(

*s*), we fitted a (penalized) logistic regression to the psychophysical data of each observer and condition. We first describe the data of observer CGF for condition 3 fitted with the logistic-regression model (Logreg), and compare it to the simple prototype model (Prot).

*which*of the stimuli the observer perceived as female. Logreg predicted the class (i.e. gender) chosen by the observer in 93% of cases. Prot, on the other hand, was correct only 83.0% of the time. Figure 2A depicts the psychometric function of observer CGF for condition 3 predicted by the logistic regression model. The empirically estimated gender assignment probability (

*s*)) for each face (y-axis) is plotted against the distance of that face from the decision boundary (filled yellow circles;

*B*= 428)). The black line is the decision probability

*p*(

*s*) predicted by the model. To help judge the quality of the fit—given that N is large and that many data-points overlap—the gray histograms show the distribution of classification responses. The top histogram consists of all the stimuli which are predominately classified by the observer as female

*Q*= 12.0% of points fall outside two standard deviations—not perfect, but arguably reasonable (88% versus 95% fall within two standard deviations). In addition, we quantify for each face its “deviation” from the model-prediction by the deviance residual (Wichmann & Hill, 2001a). For a perfect model the average deviance

*p*-value for rejecting the null-hypothesis of no difference. In this case, the median

*p*-value was 0.02. Thus, in most cases, the null-hypothesis of no difference could be rejected at a level of 5%, but not at 1%. Again, this indicates that the logistic regression decision image accounts for a large part, but not all of CGF's gender discrimination probabilities.

*Q*= 36.0% of the data points are more than two standard deviations away from the model prediction. Figures 3B and 3C show the histogram and cumulative distributions of residuals for the prototype model. The differences between the simulated and real data are evident, and the median p-value of the Kolmogorov–Smirnov test was

*p*= 6.1 × 10

^{−10}, and less than 0.01 for each simulation. Thus, while the overall percentage correct of this model was seemingly high—the Prot model predicts the observer's gender assignment 83.0% of the time—evaluating its model performance on a stimulus-by-stimulus basis shows that the prototype classifier predicts the

*decision probabilities*of the observer only poorly.

*not*imply that the model is a good description of the underlying psychophysical process. To further test and quantify the difference between the different ways to obtain the decision image, i.e. logistic regression versus mean-of-class (prototype), we calculated rank-correlations between the measured responses

*s*

_{ i}) and the distances to the decision boundary in the model,

*z*(

*s*

_{ i})—this statistic is independent of the fitted (monotonic) nonlinearity. Due to the binomial scatter, the median expected rank correlation even of a perfect model would not be 1.0, but 0.915. (We obtained 0.915 from Monte Carlo simulations of a synthetic observer which is perfectly described by the model, see above.) The rank correlation for Logreg was 0.88. In stark contrast, the rank-correlation for Prot was only 0.74 (simulated: 0.92).

*perceived*gender of each face (by each observer), and not just the

*true*gender of each face. This can be achieved by calculating correlations conditioned on the stimulus class, here gender. Correlations between human responses and the model that we find using only stimuli in one class indicate that the model is indeed capturing the observers decisions, and not just re-discovering the class-structure of the stimulus set. For a perfectly correct observer—i.e. an observer for which we know the decision features

*exactly*—the median correlations were 0.75 on the female faces and 0.58 on the male faces. The overall conditional correlation, taken as the mean of the two correlations, was 0.68. For the Logreg model the rank correlations between decisions and model predictions were 0.56 on the female faces, and 0.45 on the male faces, conditional correlation thus 0.51 (

*p*< 10

^{−10}in both cases). In contrast, the conditional correlations for Prot was only 0.24 (female: 0.27, male: 0.22). This provides further evidence that the prototype model is inadequate for modeling the observer's decisions.

*SEM*0.0578, median 1.8978), and was smaller than 2.39 for each subject and condition. For single subjects, the values (average across conditions) ranged from 1.80 to 2.2. Performance was slightly better on conditions 3 and 1 (1.85 and 1.87) than on 2 and 4 (1.99, 2.02). For the prototype classifier, the mean average deviance was 4.38 (

*SEM*0.30, median 3.99.), and the average deviance was larger than 3.25 in each experiment (see also Figure 6, top left).

*SEM*0.0035, median 0.86) for the logistic regression model, which was very close to the corresponding value for the simulated observers, which had an average correlation of 0.91. For the prototype classifier, the correlation was 0.72 on average (simulated: 0.90). Additionally, we want to evaluate the ability of the classifiers to predict the observers decisions (and not merely the class-structure of the stimulus) by calculating conditional correlations: The average conditional correlation for the logistic regression model was 0.46 (

*SEM*0.021, median 0.46, simulated: 0.65), whereas for the prototype classifier it was only 0.20 (

*SEM*0.0064, median 0.20, simulated 0.70). Thus, while the logistic regression model gets reasonably close to “optimal” (as estimated from the simulated data assuming an exact model fit), this is not the case for the prototype classifier. Figure 6 (first column) shows these comparisons for the two models and for each subject and condition.

*s*

_{ i}, we estimated the average probability of it being classified as female

_{ pool}(

*s*

_{ i}), where the average is now across all experiments. Similarly, we calculate the predicted probability

*p*

_{ pool}(

*s*

_{ i}) by averaging the predictions on different experiments. Figures 4A and 5A show scatter plots of the decision probabilities plotted against the model predictions. For the logistic regression model ( Figure 4), the decision image model predictions and those of the observers are highly correlated

*c*= 0.94, whereas for the prototype classifier ( Figure 5), the correlation is 0.80. The conditional correlation is 0.76 for Logreg (higher than for the typical single subject, as the noise can average out across observers), but only 0.30 for Prot. Thus, while the performance of the prototype classifier can be attributed mainly to its ability to reproduce the class-structure of the stimulus, the logistic regression model is highly correlated with the decisions of the human observers even within each class.

Model | Average Deviance | Percent Correct | Rank Corr | Corr(♀) | Corr(♂) |
---|---|---|---|---|---|

Logreg | 1.935 ± 0.058 | 0.922 ± 0.008 | 0.863 ± 0.003 | 0.577 ± 0.031 | 0.352 ± 0.045 |

Prot | 4.382 ± 0.297 | 0.842 ± 0.013 | 0.722 ± 0.015 | 0.279 ± 0.013 | 0.125 ± 0.011 |

LDA | 2.236 ± 0.114 | 0.913 ± 0.010 | 0.847 ± 0.006 | 0.520 ± 0.033 | 0.312 ± 0.040 |

SVM | 2.383 ± 0.118 | 0.910 ± 0.011 | 0.832 ± 0.007 | 0.465 ± 0.039 | 0.264 ± 0.031 |

*SEM*0.11, median 2.18). Thus, taking into account the covariance structure of each class can considerably improve the ability of means-of-class classifier to model our psychophysical data. However, LDA was markedly worse than the logistic regression model: For each of the 28 conditions (7 observers, 4 stimulus types), the average deviance across conditions was higher than that of the logistic regression model (

*p*< 10

^{−9}). In addition, the mean deviance was also significantly higher (paired-sample t-test,

*p*= 0.0003). Qualitatively, the same also holds true for our other performance measures, as can be seen from Figure 6, second column.

*SEM*0.12, median 2.29), and was outperformed by both the logistic regression (see Figure 6, right column) and the LDA model: It had a lower average deviance than LDA in only 3 out of 28 experiments (

*p*= 1.4·10

^{−5}). In addition, both its rank correlation and class-conditional correlation with the observer's decisions were lower in 27 out of 28 comparisons (

*p*< 10

^{−6}). Thus, while the LDA and the SVM performed much better than the simple prototype classifier, they did not reach the same level of accuracy as the decision image model estimated by logistic regression. This was also the case when we excluded the morph faces. The logistic regression model performed better than linear discriminant analysis (26 out of 28 conditions and observers), which in turn performed better than the SVM (28 out of 28) and the prototype classifier (average deviance 4.87). For modeling gender categorization of human faces logistic regression thus appears to be the method of choice to estimate the decision image (at least superior to LDA, SVM and the simple prototype).

*c*= 1. When we excluded the morphed faces from the analysis, the correlation was 0.9. For very fast decisions, the average deviance residual is as expected from a perfect model, around 1.0, whereas for slow decisions, they are substantially bigger. We repeated the same analysis for each subject individually, and found a significant correlation (at

*p*< 0.01) for 6 out of the 7 subjects, with a median correlation of 0.94. The correlation was also significant for each of the four conditions separately, and was at least 0.96 in each case.

*s*): If a face is difficult to categorize, then subjects might spend more time thinking about it. But if the algorithms also have trouble in categorizing such faces correctly, then the correlation between reaction times and deviance residuals would merely be a consequence of the common correlation with the factor “stimulus difficulty.” To rule out this possibility, we computed a partial correlation between the reaction times and deviance residuals, conditioning on the decision probabilities. We binned the decision probabilities into 5 bins, and normalized the reaction times within each bin by subtracting their mean and dividing by the standard deviation. This processing step did not eliminate the correlation between reaction time percentile and deviance, which was still 0.95, as can be seen from Figure 8. The conditional correlation was also significant for each of the four conditions (minimum 0.81) and for 6 out of 7 observers (median 0.84). Similarly, for 21 out of 28 observers and conditions separately, reaction times and deviance residuals were significantly correlated (14 out of 28 conditioning on response probability).

*λω*(

*λ*> 0) to any stimulus

*s*

_{0}, we can create a stimulus

*s*

_{0}+

*λω*with

*p*(

*s*

_{0}+

*λω*) >

*p*(

*s*

_{0}). For simplicity, we take

*s*

_{0}to be the mean face in each condition.

*s*

_{0}by adding or subtracting equal amounts of

*ω*to/from it:

*p*(

*s*+

*λω*) >

*p*(

*s*

_{0}−

*λω*) Moreover, the probabilities vary fastest when we go perpendicular to the decision boundary: For any other vector with ∣∣

*ω*∣∣ = ∣∣

*ξ*∣∣ we have that

*p*(

*s*

_{0}−

*λω*) ≤

*p*(

*s*

_{0}−

*λξ*) <

*p*(

*s*

_{0}+

*λξ*) ≤

*p*(

*s*

_{0}+

*λω*). The reason is that

*ω*

^{⊺}

*ξ*<

*ω*

^{⊺}

*ω*= ∣∣

*ω*∣∣

^{2}for any vector

*ξ*with ∣∣

*ξ*∣∣ = ∣∣

*ω*∣∣. As the absolute “length” ∣∣

*ω*

_{ i}∣∣ is arbitrary, we normalized the decision faces to have norm ∣∣

*ω*∣∣ = 1.0. In other words, adding some multiple of

*ω*to a stimulus should lead to stimuli which are more discriminable than when any other vector (of same length) is added to

*s*

_{0}. If

*ω*really captures the direction in feature space along which the male–female perception varies the most, then even small changes of

*λ*should lead to measurable changes in the decision probability

*p*(

*s*+ (

*λω*)). On the other hand, the model predicts that addition of any vector

*γ*which is orthogonal to

*ω*should not result in a stimulus which looks more female or male than

*s*

_{0}:

*p*(

*s*

_{0}±

*λγ*) =

*f*(

*ω*

^{⊺}(

*s*±

*λγ*)) =

*f*(

*ω*

^{⊺}

*s*

_{0}) =

*p*(

*s*

_{0}), if

*ω*

^{⊺}

*γ*= 0.

*s*+

*λω*and

*s*−

*λω*they perceived as looking more female. If the values of

*λ*which lead to visible gender differences for the filter

*ω*are smaller than the ones for an alternative filter

*ξ,*we can conclude that the model

*ω*captures the decision boundary better than

*ξ*. For each choice of

*ω,*we tested 10 different values of

*λ*in order to map out psychometric functions and thus test the sensitivity of the subjects to changes in

*ω*. The pairs of faces

*s*

_{±}(

*λω*) were generated using the decision images

*ω*obtained from the first experiment. We used the

*ω*s corresponding to the weight vectors of logistic regression, the prototype classifier and the support vector machine for conditions 1, 2 and 3. In addition, we used a slightly low-passed version of the Logreg filter

*ω*

_{2}. Using a prior which explicitly enforces smooth decision boundaries would make this smoothing step unnecessary (see Discussion) (Knoblauch & Maloney, 2008; Ross & Cohen, 2009).

*p*(

*s*

_{0}±

*λγ*) =

*p*(

*s*

_{0}) for all values of

*λ*. We therefore generated two additional filters

*γ*for each subject which were chosen to be orthogonal to the filter of logistic regression, i.e. such that

*ω*

^{⊺}

*γ*= 0. We constrained

*γ*to be a linear combination of the eigenvectors of the data set of all faces, where the weights were chosen to be of similar magnitude as the decision images. This procedure was used with the aim of making the statistics of the non-directions as similar to the decision images as possible. However, as this constraint does not uniquely determine

*γ,*we picked a direction at random for each subject.

*γ*), and on ten different values of

*λ*. Combinations were presented intermixed and in random order, and each combination was presented 225 times. This resulted in a total number of 11250 trials for each subject.

*s*±

*λω*were presented next to each other in the same experimental setup as described above. We randomized whether the face on the left side corresponded to

*s*+

*λω*or

*s*−

*λω*. Each pair was presented using a modified Hanning window with a rise time of 300 ms, a plateau of 300 ms and a fall time of 300 ms. The inter-stimulus interval was 200 ms. On each trial, the subjects indicated whether the left face or the right face looked more female by pressing a button on a touchpad. Subjects were asked to respond quickly, and were not given feedback on their performance.

*ω*of each model is a vector which has exactly the same dimensionality as the stimuli (faces). Therefore, it can be visualized as an image (see Figure 9), and this is the reason we refer to these filters as

*decision images,*or “decision faces” in the particular experimental context considered here. Both the decision image of logistic regression and that of the prototype classifier place power at the eye-region, indicating that this region is important for gender categorization of human faces. This finding is consistent with previous studies using bubbles (Dupuis-Roy et al., 2009), classification images (Mangini & Biederman, 2004; Sekuler et al., 2004), or analysis of photographs and gender ratings (Russell, 2009). However, while both models place emphasis on the eye region, they actually have vastly different prediction performance: Thus, mere localization of important regions is not sufficient to predict human behavior, but the exact filter shapes also matter.

*s*

_{0}, as described above. From Figure 10, we can see that subtracting

*ω*leads to a male looking face (left column) whereas adding

*ω*results in a female looking face (right column). This is true both for the filter for Logreg (first row) as well as for Prot (second row), but the effect is stronger for the Logreg-filter. In contrast, using a direction along the decision boundary does not result in faces which look either male or female (last row).

*ω,*we fitted psychometric functions to the responses of the subject against the weight

*λ,*which determines ‘how much’ of the decision image is added to a neutral face. We quantified the performance of each model by finding the value of

*λ*at which subjects were 90% correct. Averaged across observers and conditions, the filter

*ω*corresponding to the decision image of logistic regression performed best, and reached the performance criterion for values of

*λ*less than 0.1 (see Figure 11D). The filter of logistic regression with low-pass filtering slightly outperformed the original Logreg filter (mean-

*λ*at 90% correct: 0.081 vs. 0.0947). This is an indication that our regularization procedure is not optimal yet (see Discussion). Both of these decision images consistently outperformed the SVM (mean-

*λ*= 0.11) and the prototype classifier (mean-

*λ*= 0.14).

*p*= 0.002). In 7 out of 9 experiments, smoothing of the decision image did lead to a performance increase (

*p*= 0.089). As we mentioned previously, the absolute scaling of

*λ*is arbitrary, and it is really the relative differences which are meaningful. Therefore, for each classifier and condition, we calculated the percentage correct attained at the value of

*λ*at which the best algorithm achieved 90% correct. On average, when logistic regression (lowpass) was at 90%, Logreg was at 84% ± 2.3, SVM at 78% ± 2.6 and Prot at only 74% ± 3.0.

*γ*which was parallel to the decision boundary did not result in any differences in perceived gender; in fact, the psychometric function was flat across the whole range of

*λ*'s considered, and never achieved 90% correct. Thus, at least for the particular directions that we tested, we can conclude that moving along the decision boundary does not result in stimuli which appear male or female (see Figures 10A– 10D), which is in accordance with our predictions.

*ω*

_{ i}of the decision image will depend on the covariance structure of the stimuli. For simplicity, we will discuss this dependence for Fisher's linear discriminant: In this case, the decision boundary is

*ω*=

*C*

^{−1}(

*μ*

_{+}+

*μ*

_{−}), where

*C*is the within-class covariance, and

*μ*

_{±}are the means of the two (perceived) classes. For every pixel

*k,*the difference between the two classes is inversely weighted by the variance within each class. An entry

*ω*

_{ k}will be large if either the pixel

*s*

_{( k)}varies a lot between classes or varies little within the classes. In terms of spatial frequency, the effect of pre-multiplying by

*C*

^{−1}can be interpreted as dampening of those spatial frequencies which are more dominant in the stimulus. For faces and for most natural stimuli low spatial frequencies are more powerful, this can explain why the decision images of the Prot classifier (which ignores the covariance structure) have much power in the low spatial frequencies than those which do take the covariance structure into account.

*λ*of the decision image

*ω*only makes sense for small values of

*λ,*i.e. for local perturbations: Suppose that for some pixel

*k,*the variance

*C*

_{ kk}is small, and therefore that

*ω*

_{ i}is large. This means that even small changes in this pixel value will lead to noticeably changes in its perceived class. However, the fact that

*C*

_{ kk}is small also implies that, for most stimuli, the pixel value

*s*

_{( k)}will be small. Thus, if

*λ*is chosen too large, the corresponding pixel entry

*s*+

*λω*

_{ k}will be large, and is likely to be larger than it would be in any real face, creating artifacts and unnaturally looking stimuli. In our experiments we easily found values of

*λ*which appeared distinctly male or female, and at the same time still looked very ‘face-like’. However, larger values of

*λ*resulted in noticeable artifacts.

*C*

_{ kk}) performs so poorly implies that human observers must implicitly take the covariance structure of faces into account for making the decisions about the gender of a face. A related (but not equivalent) idea has been formulated by studies (e.g. O'Toole et al., 1993) using principal component analyses (PCA) on face images and arguing that the first principal component was discriminative between male and female faces. In our experiments, all faces were normalized to have the same mean luminance, and standard deviation of luminance values. Consequently, the first principal component to determine gender was not a good predictor of gender: For example, on the stimuli condition 1, it resulted in a percentage correct of only 56.9%. This suggests that the separation of the two classes by the first principal component is a consequence of an overall brightness difference between male and female faces (Russell, 2009), and therefore only works for un-normalized stimuli. Given that the absolute light level—and hence the light intensity of faces—changes under real-life conditions, reliance on the first principal component for gender discrimination is clearly a sub-optimal strategy. The performance of observers in our experiments with normalized stimuli also shows that absolute luminance is not necessary for gender classification.

*ω*

_{ i}on the covariance structure of the stimulus for the case of linear discriminant analysis, as its decision boundary is solely determined by the covariance and means of the two classes. However, the conclusions qualitatively also apply to algorithms such as logistic regression or SVMs. These algorithms—unlike LDA—are not motivated by specific assumptions about the statistical structure of the stimuli and are therefore likely to work better for stimuli for which simple distributions such as Gaussians are inappropriate: This is the case for many classes of naturalistic stimuli, e.g. natural images.

*λ*(see above). On the other hand, interpretation of the decision image might be harder, as the identified features have to be “projected-out” of the morph-space into pixel space to be interpretable (c.f. Kienzle et al., 2009). Alternatively, one could parameterize the stimuli not by their pixel values, but use a set of basis functions which are of the kind that is thought to be implemented in early stages of the visual pathway. For example, one could project the stimuli onto a filter bank of oriented filters, and apply the algorithm to the filter-responses.

*ω*

_{ opt}is a maximum a posteriori (MAP) solution, i.e. the most likely filter given the data and the prior. Consequently,

*p*(

*s*) =

*p*(

*ω*

_{ opt}

^{⊺}

*s*) is the “most likely” value of the decision probability for stimulus

*s*. The MAP-predictor is a very popular estimator in statistics, and is often even referred to as “Bayes-optimal.” However, the MAP-estimator is only optimal under very particular conditions (under the assumption of a 0/1 loss (Bishop, 2006)), and a better predictor is the posterior mean probability

*p*(

*s*) = ∫

_{ω}

*f*(

*ω*

^{⊺}

*s*)

*π*(

*ω*)

*dω*. Not a single filter is used for prediction, but rather an integral over all possible filters, weighted by their relative posterior probabilities,

*π*(

*ω*). In such a fully Bayesian model, the decision boundary is still linear, and the direction along which the probabilities vary the fastest is still normal to the decision boundary. However, in general, the iso-probability contours (i.e. the lines along which

*p*(

*s*) is constant) are no longer hyper-planes parallel to the decision boundary (Bishop, 2006). Rather,

*p*(

*s*) depends both on the distance of

*s*from the decision boundary, as well as on the distance of

*s*to the mean of the prior. The closer

*s*is to the prior mean, the faster the probabilities change when going away from the boundary. It will be interesting to see in the future whether such fully Bayesian models can be differentiated from our MAP model based on experimental data.

*f*(

*x*) = ∑

_{i}

*y*

_{i}

*α*

_{i}

*k*(

*x*

_{i},

*x*) +

*β,*where the

*kernel function k*specifies the dot-product of any two data points in feature-space. Intuitively, the

*kernel function*captures the similarity between any two points.

*f*is found by minimizing the cost-function

*y*

_{ i}

*f*(

*x*

_{ i})∣

_{+}= max(0, 1 −

*y*

_{ i}

*f*(

*x*

_{ i})). This optimization problem can be rewritten as

*α*

_{ i}on the data samples. This implies that the number of parameters in the optimization is independent of the dimensionality of the feature space, which makes it possible to work in high (or infinite dimensional) feature spaces. However, it also implies that the computational requirements of the algorithm scale with the size of the data set considered. Thus, although in principle, repeated presentations of the stimulus can be handled by simply inserting each presentation as a separate data point, this procedure can quickly result in very large data sets which become computationally inconvenient or even infeasible. Furthermore, we also want to be able to handle situations in which the decision probabilities

*p*

_{ i}are arbitrary continuous quantities in [0, 1], and not estimated over multiple stimulus presentations.

*x*

_{ i}does not only have a class-label

*y*

_{ i}, but also a corresponding confidence or decision probability

*p*

_{ i}.

*p*

_{ i}determines our confidence as to whether the class-label

*y*

_{ i}is correct. For example, if

*p*

_{ i}= 0.5, we know that

*x*

_{ i}is an entirely ambiguous stimulus. Furthermore, the case

*p*

_{ i}=

*p, y*

_{ i}= −1 is equivalent to

*p*

_{ i}= 1 −

*p*and

*y*

_{ i}= 1. To resolve this ambiguity, for every stimulus with labels (

*y*

_{ i},

*p*

_{ i}), we add a second stimulus

*p*

_{ i + n}= 1 −

*p*

_{ i},

*y*

_{ i + n}= −

*y*

_{ i}. Then, the loss function can be generalized to

*p*

_{ i}. The second part of the loss function, which regularizes the shape of the decision function, is unchanged. This is equivalent to

*n*rather than

*n*weights

*α,*and that the upper bound on each

*α*

_{ i}is

*Cp*

_{ i}rather than

*C*. If any

*p*

_{ i}is zero, we know that

*α*

_{ i}= 0, i.e. the corresponding stimulus can be eliminated from the optimization problem, and the sum is over less than 2

*n*terms.

*p*

_{ i}= 0.5, but rather aims to place them close to the decision boundary. Even stimuli which are classified very inconsistently can provide important information about the position of the decision boundary. A similar idea has been studied in the field of machine learning under the name of ‘Universum SVM’ (Weston, Collobert, Sinz, Bottou, & Vapnik, 2006), which uses a ‘Universum’ of ‘non-examples’ in addition to the labeled data points. While the derivation and motivation of the USVM is very different to our approach, it is mathematically equivalent to a SVM with confidences

*p*

_{i}which are either 0, 1 or exactly 0.5: The ‘Universum’ consists of those data points which have completely ambiguous class labels.