Abstract

This paper describes a new model for human visual classification that enables the recovery of image features that explain performance on different visual classification tasks. Unlike some common methods, this algorithm does not explain performance with a single linear classifier operating on raw image pixels. Instead, it models classification as the result of combining the output of multiple feature detectors. This approach extracts more information about human visual classification than has been previously possible with other methods and provides a foundation for further exploration.

Introduction

The classification image algorithm (Ahumada, 2002) is one of the most successful tools for determining the information observers use to make visual classifications. In a typical classification image experiment, participants are presented with numerous noise-corrupted images from two categories and are asked to classify each one. The noise ensures that the image samples cover a large volume of the stimulus space. The classification image algorithm finds the linear classifier that best partitions the classified images. The linear classifier is defined by a set of weights which, when displayed visually, is called a βclassification image.β Analyzing the classification image reveals the extent to which each image region is correlated with an observer's classifications.Β

The classification process, however, may be more structured than the classification image approach suggests. For example, it may be that classification is the result of combining the detection of parts or features, rather than applying a single linear template. Consider the simple example of an observer attempting to distinguish the letter βPβ from βQβ in the presence of visual noise. Given an image to classify, the observer may determine the presence of a βPβ not by applying a single linear classifier, but by applying a set of independent feature detectors and combining their output. Detecting a vertical line or an upper right-facing curve would favor a βPβ response, while discerning a circle or a low diagonal line would favor a βQβ response. Although a classification image would indicate the importance of all the featuresβ component pixels to the observer's responses, it would not indicate their division into four separately detected features.Β

Recent evidence suggests that human observers utilize multi-feature models in some image classification tasks. For example, Pelli, Farell, and Moore (2003) have convincingly demonstrated that humans recognize noisy word images by parts even though better performance can be achieved by integrating information from the entire word. Similarly, Gold, Cohen, and Shiffrin (2006) verified that participants employed feature-based classification strategies for some simple image classes.Β

Cohen, Shiffrin, Gold, Ross, and Ross (2007) used a Gaussian Mixture Model (GMM) to recover the multiple features that may be used in a classification experiment. A GMM is a technique that clusters data into a fixed number of groups, each of which is modeled by a multivariate Gaussian distribution. If an observer employs a non-linear, multi-feature strategy, as in the βPβ and βQβ example, the GMM algorithm can, under a reasonable set of assumptions, associate a cluster with each feature, providing more information about the observer's visual processing than the classification image approach.Β

Despite its success, the GMM has several limitations:Β

- Β The GMM describes the data, but does not provide an explicit model of the classification process.
- Β The classification image technique is not well described as a special case of GMMβit is not similar to the GMM with a single feature.
- Β Using the GMM requires experimenters to collect the participants' confidence ratings on each classification decision, and then discard all but the high-confidence trials. This procedure wastes data and, because traditional classification image experiments do not measure confidence, impedes applying the algorithm to previously collected data.
- Β The recovered features are biased by the pixel values of the uncorrupted images composing each class, preventing a clear distinction between the structure of the experiment and the participants' internal visual mechanisms.

This paper describes and applies GRIFT (GRaphical model for Inferring Feature Templates), a model of human image classification that addresses all of these limitations. GRIFT models the image classification process using a

*Bayesian network,*a probabilistic model that represents the causal relationships between variables (Pearl, 1988). GRIFT posits that human image classification is a non-linear process that results from combining the outputs of several independently computed feature detectors. Just as with the GMM, GRIFT can be applied to classification data to recover the features used to discriminate between two classes. Unlike the GMM, GRIFT provides an explicit classification model, can be used as a replacement for classification images in the single-feature case, avoids stimulus bias, and can be used on experimental data that lack confidence ratings. Ross and Cohen (2008) first presented GRIFT in the proceedings of the Neural Information Processing Systems conference. The goal of this paper is to provide a more detailed explanation of the GRIFT algorithm, to present more extensive experimental and simulation results, to extend the GRIFT model to ratings data, and to demonstrate the model's applicability to detection experiments.ΒThe remainder of this paper describes the GRIFT model and the algorithm for fitting it to experimental data. We then demonstrate the efficacy of the model on simulated data sets and on data sets previously analyzed in Cohen et al. (2007), Gold et al. (2006), and Gold, Murray, Bennett, and Sekuler (2000). Then a series of newer experiments are described along with their GRIFT results. The paper concludes with a discussion of proposed extensions to GRIFT and possible future experiments.Β

The GRIFT model

To clarify the presentation, it is important to define the terms βpixel,β βimage,β and βfeature.β A pixel is a square region of a stimulus that has a homogeneous brightness value. In an experiment, each stimulus pixel may be displayed using one or more computer monitor pixels depending on the spatial scaling applied to the stimulusβall references to βpixelsβ will refer to units of the stimulus, rather than those of the display hardware. An image is a collection of grayscale pixels arranged in a two-dimensional matrix. Images can be indexed by row and column. The two-dimensional location of particular pixels, however, is often unimportant. In this work, it is sometimes simpler to use a single pixel index. For example, the 20

^{th}pixel in a stimulus will be referred to as*S*_{20}.ΒThe definition of the final term, βfeature,β is undoubtedly the most controversial. Researchers have defined feature in many different ways, but in this paper a feature is a pixel pattern detected by a linear classifier. A linear classifier is a set of weights assigned to each image pixel along with a threshold value. If the weighted sum of an image's pixels is less than the threshold, the feature is considered present in the image, otherwise the feature is absent. Linear classifiers can define many useful types of features, including those based on the contrast between different image regions. Because the human visual system is contrast-sensitive (Palmer, 1999) and contrast features can indicate the presence or absence of specific pixel patterns, such linear features are particularly informative. Extensions of GRIFT to different types of features are discussed below. Under this definition of feature, the classification image model, which models classification as the result of a single linear classifier, is a single-feature model, while GRIFT is a multi-feature model.Β

GRIFT represents the classification process as a

*Bayesian network*(Pearl, 1988), which is a type of*graphical model*. Physicists, statisticians, and computer scientists developed graphical models to describe the probabilistic interactions between variables. Graphical models are fundamental to modern research in artificial intelligence (e.g., Bishop, 2006) and computer vision (e.g., Forsyth & Ponce, 2003) and are playing a larger role in psychology (e.g., Rehder, 2003). The adjective βgraphicalβ refers to the fact that these models are commonly represented by a graph in which nodes indicate variables and edges connecting the nodes indicate the influence of one variable on another. Bishop (2006) contains full details on all the types of graphical models discussed in this paper.ΒA Bayesian network (typically shortened to βBayes netβ) is a causal graphical model. It represents each variable as a node, and connects them with directed edges (lines with arrows at one end) that point from a cause to an effect. The diagram in Figure 1 represents the causal relationships between the variables of the GRIFT model. The GRIFT model describes each classification,

*C,*as the result of a stimulus,*S,*which is processed by a set of*N*feature detectors,*F*= {*F*_{1},*F*_{2}, β¦,*F*_{ N}}, that are instantiated as linear classifiers. Because there is an arrow from the stimulus*S*to each*F*_{ i},*S*is a*parent*of each*F*_{ i}, and*S*directly influences the probability distribution of their values. Similarly, each*F*_{ i}directly influences*C*. No other variables directly influence each other. In particular,*S*only influences*C*through the*F*_{ i}s and the*F*_{ i}s do not directly influence each other. Variables that do not directly influence each other are called*conditionally independent*. Bayes nets efficiently model the interaction of many variables by encoding assumptions about their conditional independence.ΒFigure 1

Figure 1

Because an image can be represented as a vector of pixel values, each image is a point in a high-dimensional space. The linear classifier associated with each feature describes a boundary in this image space that separates images that have that feature from images that do not. That is, each feature detector is sensitive to a particular type of inputβfor example, feature detector

*F*_{1}might respond best to the vertical line of a βPβ, while feature detector*F*_{2}might respond best to the circle in a βQβ. The*F*_{ i}variables are binary and indicate if a feature has been detected (*F*_{ i}= 1) or not (*F*_{ i}= 0). Whereas a typical linear classifier is deterministic, the feature detectors in GRIFT are probabilistic. An image that falls near the boundary will have an approximately 50% chance of activating the feature detector, an image far to one side of the boundary will produce nearly a 100% probability of activation, and an image far to the other side will produce nearly a 0% probability of activation. Therefore, a clear, low noise, high contrast image of a βPβ will almost certainly cause*F*_{1}= 1, but a noisy, low-contrast image of a βQβ might only have a small chance of causing*F*_{2}= 1.ΒThe feature activations influence the classification,

*C,*of a stimulus. In most of this paper, we assume two response classes, but an extension to ratings is described below and other generalizations are possible. The presence of some features increases the probability of choosing Class 1 (*C*= 1) and the presence of others increases the probability of choosing Class 2 (*C*= 2). Returning to the βPβ and βQβ example, activation of the vertical line feature would increase the probability of responding βPβ. Activation of a circle feature would increase the probability of responding βQβ.ΒMathematically, the Bayes net representation is useful because it is an efficient representation for the joint probability distribution of all the model variables. For a Bayes net with variables where This factorization generally provides a much more efficient representation than would be possible if the joint distribution were represented without any assumptions about causality or conditional independence.Β

*V*= {*V*_{1},*V*_{2}, β¦,*V*_{ M}}, the joint probability distribution is Β$P(V)= \beta \x88\x8f i = 1 MP( V i|\beta \x81\u2019p\beta \x81\u2019a\beta \x81\u2019r\beta \x81\u2019e\beta \x81\u2019n\beta \x81\u2019t\beta \x81\u2019s( V i)),$

(1)

*P*(*V*_{ i}β£parents(*V*_{ i})) is the conditional probability distribution of*V*_{ i}given its parents in the graph. Therefore, in the GRIFT model, Β$P(S,F,C)=P(S)P(C\beta \x81\u2019|\beta \x81\u2019F) \beta \x88\x8f i = 1 NP( F i|\beta \x81\u2019S).$

(2)

Now that the structure of the model is established, the next task is to specify its component conditional probability distributions:

*P*(*S*),*P*(*C*β£*F*), and*P*(*F*_{ i}β£*S*). The distribution of the stimuli,*P*(*S*), is under the control of the experimenter. Fitting the GRIFT model to experimental data only relies on the assumption that, across trials, the stimuli sampled from this distribution are independent and identically distributed.ΒThe conditional distribution of each feature detector's value, and Β where β£

*P*(*F*_{ i}β£*S*), is modeled as a logistic regression function on the pixel values of*S*. Because the feature detectors are assumed to be linear classifiers, each feature distribution is governed by two parameters, a weight vector*Ο*_{ i}and a threshold β*Ξ²*_{ i}, such that Β$P( F i=1\beta \x81\u2019|\beta \x81\u2019S, \Omicron \x89 i, \Xi \xb2 i)=(1+exp( \Xi \xb2 i+ \beta \x88\x91 j = 1 | S | \Omicron \x89 i \beta \x81\u2019 j S j) ) \beta \x88\x92 1$

(3)

$P( F i=0\beta \x81\u2019|\beta \x81\u2019S, \Omicron \x89 i, \Xi \xb2 i)=1\beta \x88\x92P( F i=1\beta \x81\u2019|\beta \x81\u2019S, \Omicron \x89 i, \Xi \xb2 i),$

(4)

*S*β£ is the number of pixels in a stimulus and*Ο*_{ ij}is the*j*^{th}element of vector*Ο*_{ i}. The logistic regression function satisfies the probabilistic classification properties outlined above: Stimuli near the boundary are classified less deterministically than those far from the boundary. In image pixel space, the weights define the orientation of the boundary that determine the presence or absence of the feature. They also determine the degree of probabilistic behavior that the detector will exhibitβweights with larger absolute values lead to more deterministic output. The weights and threshold jointly determine the probability that a feature is detected for a particular image. There is a 50% probability that a feature will be detected when$ \beta \x88\x91 j = 1 | \beta \x81\u2019 S \beta \x81\u2019 |$

*Ο*_{ ij}*S*_{ j}= β*Ξ²*_{ i}. When$ \beta \x88\x91 j = 1 | \beta \x81\u2019 S \beta \x81\u2019 |$

*Ο*_{ ij}*S*_{ j}> β*Ξ²*_{ i}, i.e., the image is on the βabsentβ side of the linear boundary, there is less than 50% chance that the feature will be detected. Likewise, when$ \beta \x88\x91 j = 1 | \beta \x81\u2019 S \beta \x81\u2019 |$

*Ο*_{ ij}*S*_{ j}< β*Ξ²*_{ i}, i.e., the image is on the βpresentβ side of the linear boundary, there is a greater than 50% chance that the feature will be detected.ΒThe conditional distribution of and Β Detecting a feature with negative

*C*is represented by a logistic regression function on the feature outputs. Therefore, the conditional distribution of*C*is determined by a weight vector*Ξ»*and a threshold β*Ξ³*that determine the impact of feature activations on the probability of a particular response, such that Β$P(C=1\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=(1+exp(\Xi \xb3+ \beta \x88\x91 i = 1 N \Xi \xbb i F i) ) \beta \x88\x92 1$

(5)

$P(C=2\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=1\beta \x88\x92P(C=1\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3).$

(6)

*Ξ»*_{ i}increases the probability that the observer will respond βClass 1β and detecting a feature with positive*Ξ»*_{ i}increases the probability that the observer will respond βClass 2.β Note that*Ξ³*serves the same role as*Ξ²*_{ i}in Equation 3.Β Equations 5 and 6 can be generalized to represent a conditional probability distribution over ratings rather than classifications. Instead of only allowing two responses based on the feature detector values, ratings allow an observer to respond with an integer between 1 and in which the probability of rating 1 isΒ and, for any other rating,Β If

*R,*where 1 indicates βdefinitely class 1,β*R*indicates βdefinitely class 2,β and the values in between indicate intermediate degrees of belief. Because they are ordered, rating probabilities can be represented with ordinal logistic regression (Agresti, 2002). The probability of a responding with a rating less than or equal to*r,*where 1 β€*r*β€*R*β 1, is given byΒ$P(C\beta \x89\u20acr\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=(1+exp(\Xi \xb3r+\beta \x88\x91i=1N\Xi \xbbiFi))\beta \x88\x921$

(7)

*Ξ³*is a vector with*R*β 1 elements for which*Ξ³*_{r}β€*Ξ³*_{rβ1}.^{1}The probability of rating*R*is thereforeΒ$P(C=R\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=1\beta \x88\x92P(C\beta \x89\u20acR\beta \x88\x921\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3),$

(8)

$P(C=1\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=P(C\beta \x89\u20ac1\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3),$

(9)

$P(C=r\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)=P(C\beta \x89\u20ac\beta \x81\u2019r\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)\beta \x88\x92P(C\beta \x81\u2019\beta \x89\u20acr\beta \x88\x921\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3).$

(10)

*R*= 2, these equations correspond exactly to the binary classification probabilities described in Equations 5 and 6.Β Figure 2 shows the full GRIFT model, including the parameters. To avoid clutter, the figure uses

*plate notation,*in which duplicated model structures are drawn once and enclosed in a box. The βNβ in the lower right corner indicates that all the variables in the box and their connections to other variables are duplicated once for each of the features in the model. Note that in Figure 2 the parameters are represented as parents of the previously described GRIFT variables. In accordance with the techniques of Bayesian statistics (Gelman, Carlin, Stern, & Rubin, 2004), the parameters are themselves treated as random variables.ΒFigure 2

Figure 2

Given data from an observer, i.e., a set of trials, each represented by a stimulus,

*S,*and a response,*C,*the goal is to find the GRIFT parameter values that best account for the data. This parameter search is computationally complex. Even for small images and few features, this model has many parameters: Each*Ο*_{ i}is a vector with as many dimensions as there are pixels in*S*and each feature also contributes a*Ξ²*_{ i}and*Ξ»*_{ i}parameter. The fact that the*F*_{ i}variables are hidden, i.e., not directly measurable, also substantially increases the challenge of fitting this model to data. The primary advantage of the Bayesian approach is that it provides a principled way to place constraints on the parameters that make model fitting practical given a reasonable amount of data.ΒIn Bayesian models, constraints on parameters are represented by prior probability distributions. The priors represent our beliefs about the parameters before any experimental evidence is gathered. After data are gathered, knowledge about each parameter is represented by a posterior distribution, which is conditioned on all the observed data. The posterior describes the combined influence of the prior assumptions and the model likelihood given the observed data. As more data are gathered, the influence of the priors decreases.Β

The prior on each This prior has a number of desirable characteristics. First, if the density of a prior is too concentrated, its influence on the results will be very strong unless there are a lot of data. Each component normal distribution in Equation 7 has unit variance, which makes the distributions broad enough that the data collected in our experiments will largely determine the parameter estimates. Second, if

*Ξ»*_{ i}parameter reflects the assumption that each feature should have a significant impact on the classification, but no single feature should make the classification deterministic. In particular, the prior is a mixture of two normal distributions with means at Β±2, Β$P( \Xi \xbb i)= 1 2 2 \beta \x81\u2019 \Omicron \x80(exp(\beta \x88\x92 ( \Xi \xbb i + 2 ) 2 2)+exp(\beta \x88\x92 ( \Xi \xbb i \beta \x88\x92 2 ) 2 2)).$

(11)

*Ξ»*_{ i}β 0,*F*_{ i}'s output will not significantly influence*C*. In contrast, if any*Ξ»*_{ i}is too far from zero, a single active feature can make the response nearly deterministic. To avoid these extremes, most of the mass of the prior should be significantly far from zero, but not concentrated at large positive or negative values. The means of the component normals, β2 and 2, determined empirically, satisfy these constraints.^{2}ΒBecause the best This constant prior indicates no preference for any particular value. Although this function does not integrate to 1 as

*Ξ³*is largely determined by the*Ξ»*_{ i}s and the distributions of*F*and*S, Ξ³*has a non-informative prior, Β$P(\Xi \xb3)=1.$

(12)

*Ξ³*ranges from negative to positive infinity, and is therefore not a true probability density, it can be used as an*improper prior*(Gelman et al., 2004). Improper priors are an accepted Bayesian statistical technique so long as they produce normalized posterior distributions. In GRIFT,*P*(*Ξ³*) has no effect on the posterior distributions of the parameters, as demonstrated in the 1, and therefore it is an acceptable improper prior. Analogously,*P*(*Ξ²*_{i}) = 1 for all*i*.ΒBecause each where

*Ο*_{ i}vector has dimensionality equal to the number of pixels in a stimulus, these parameters present the biggest inferential challenge. As mentioned previously, human visual processing is sensitive to contrasts between image regions. If one image region is assigned positive*Ο*_{ ij}s and another is assigned negative*Ο*_{ ij}s, the feature detector will be sensitive to the contrast between them. This contrast between regions requires all the pixels within each region to share similar*Ο*_{ ij}values. To encourage this local structure and reduce the difficulty of recovering the*Ο*_{ i}s, the prior distribution was designed to favor assigning similar weights to neighboring pixels. Each*Ο*_{ i}parameter has a prior distribution given by Β$P( \Omicron \x89 i)\beta \x88\x9d [ \beta \x88\x8f j exp ( \beta \x88\x92 ( \Omicron \x89 i \beta \x81\u2019 j \beta \x88\x92 1 ) 2 2 ) + exp ( \beta \x88\x92 ( \Omicron \x89 i \beta \x81\u2019 j + 1 ) 2 2 ) ] [ \beta \x88\x8f ( j , k ) \beta \x88\x88 A exp ( \beta \x88\x92 ( \Omicron \x89 i \beta \x81\u2019 j \beta \x88\x92 \Omicron \x89 i \beta \x81\u2019 k ) 2 2 ) ],$

(13)

*A*is the set of neighboring pixel locations in the stimulus. This density function has two elements. The first term is a mixture of two normal distributions. The components have modes at 1 and β1, respectively, and each has unit variance. The combination assigns roughly equal probability to*Ο*_{ ij}values between β1 and 1, but unlike a uniform prior, it places some probability mass at every value and therefore allows*Ο*_{ ij}values to lie outside that range. The second component increases as the weights assigned to neighboring pixels become more similar and decreases as they become more different. This type of probability function is known as a*Markov random field*(Besag, 1974, see also Bishop, 2006), a class of graphical model frequently used in computer vision and physics. Geman and Geman (1984) pioneered the use of MRFs in computer vision as a model for reconstructing noisy images. For the purpose of fitting the model, there is no need to normalize this distribution because the normalization is constant with respect to*Ο*_{i}.ΒUsing Equation 1 to combine all of the parameters and variables into a single probability distribution, the model is described by Β Β

$P(S,F,C,\Omicron \x89,\Xi \xb2,\Xi \xbb,\Xi \xb3)=P(S)P(C\beta \x81\u2019|\beta \x81\u2019F,\Xi \xbb,\Xi \xb3)P(\Xi \xb3) \beta \x88\x8f i = 1 NP( F i|\beta \x81\u2019S, \Omicron \x89 i, \Xi \xb2 i)P( \Omicron \x89 i)P( \Xi \xb2 i)P( \Xi \xbb i).$

(14)

Fitting GRIFT to data

Given experimental data (observed classifications for a set of stimuli), the GRIFT model, and priors on the model parameters, the next step is to determine the GRIFT parameter values that best satisfy the prior distributions and best account for the (

*S, C*) sample pairs gathered from a human observer. The method used to find these parameter values (provided in 1) is an instance of the expectation-maximization (EM) algorithm, a powerful technique for fitting models with unobserved variables such as GRIFT's feature detectors (Dempster, Laird, & Rubin, 1977). EM chooses an initial value for all the parameters and uses it to compute better parameter values from an estimate of the hidden variables' values. The better estimate then replaces the initial parameter values and the algorithm repeats until convergence. EM guarantees the discovery of locally optimal parameter values. By running EM many times with randomized initial parameter values, we can increase the chance that the best parameter values discovered are the globally optimal parameters for the data.ΒFitting GRIFT to data requires choosing a value for

*N,*the number of feature detectors. Determining the optimal*N*with a high degree of certainty is difficult. Increasing the number of model parameters, in this case, increasing*N,*almost always improves the ability of the model to fit the data. At the same time, however, increasing the number of parameters also generally increases the chance that the model will overfit the data, i.e., explain noise in the data rather than produce an accurate representation of the classification process. Similar difficulties exist, for example, in determining the correct dimensionality for a multidimensional scaling solution (Borg & Groenen, 1997) or choosing the maximum degree to use in polynomial regression (Bishop, 2006). Because there is no generally accepted solution for determining the correct number of features, we recommend that any application of GRIFT proceed in three steps. The first two steps are outlined in this paper and involve using GRIFT to recover a potential set of features for a range of*N*and then evaluating each*N*based on a number of quantitative and qualitative measures discussed below. The final step involves performing additional experiments to verify the features recovered by GRIFT. The experimental results discussed below include the GRIFT models fit with a reasonable range of*N*values. We present supplementary statistics that, in most cases, either indicate the correct*N*or provide a strong indication of which values are likely.ΒIn some cases, it is obvious when

*N*is too large, for example, when the model-fitting algorithm produces feature detectors that either never fire or always fire. Detectors that never fire cannot influence the classification output. If detector*F*_{ i}always fires, regardless of the stimulus presented, a mathematically equivalent model can be constructed by removing that feature and adding*Ξ»*_{ i}to*Ξ³*. In the Results and Discussion section, we will present the probability of each feature detector firing conditioned on the observers' responses, which can indicate when these useless feature detectors are present in a GRIFT model. The appearance of either type of feature indicates that*N*is set too high.^{3}ΒAnother option is to compute a statistic that has been shown to be helpful in indicating model over-fitting. One commonly used statistic is the Akaike Information Criterion (AIC), which is 2

*k*β 2ln*L,*where*k*is the number of free parameters in the model and*L*is the likelihood of the parameter values given the data (Akaike, 1974). For our model, the statistic equals 2(*N*(β£*S*β£ + 2) + (*R*β 1)) β 2ln(*P*(S, Cβ£*ΞΈ*)), where S and C are the observations from all the trials. Lower AIC scores indicate better models. Using this statistic penalizes increases in model complexity, in this case, increases in*N,*that do not result in substantial increases in model likelihood.^{4}ΒExperiments

Five experiments were analyzed to validate the GRIFT model and discover potential multi-feature, non-linear classification strategies. All experiments used variations of the traditional classification image experimental design in which participants are asked to classify a series of noise-corrupted images (e.g., Gold et al., 2006). The first four experiments (

*light-dark, faces, four-square,*and*Kanizsa*) were classification experiments in which participants categorized stimuli into one of two classes. Each class contained one or more target images. The experiments differed in the number and type of targets in each class. To show that GRIFT may also be adapted to other experimental paradigms, the fifth experiment (*square-detection*) was a detection experiment in which participants were asked to determine if a single target was present and to respond with a confidence rating.ΒThe light-dark and faces experiments were first described in Ross and Cohen (2008) and are reanalyzed here. The four-square and Kanizsa data were first presented in Gold et al. (2006) and Gold et al. (2000), respectively. GRIFT was initially applied to the four-square data in Ross and Cohen (2008), but has not been previously applied to the Kanizsa data. New GRIFT model fits and additional analyses for both the four-square and Kanizsa data sets are presented here. The square-detection data have not been previously published.Β

General method

Classification experiments

Design details of the previously published classification experiments are given in the papers cited above. On each trial, a participant saw a stimulus (a sample from

*P*(*S*)) that consisted of a randomly selected target with independent, identically distributed noise added to each pixel. In particular, a stimulus (*S*) was produced by randomly selecting one of the available targets (*T*), multiplying it by a contrast level (*b*), and adding random independent truncated Gaussian noise^{5}(*G*) at every pixel:*S*=*bT*+*G*. The participant's task was to choose the class of the underlying target. Feedback was provided after each trial. In the Kanizsa experiment, participants completed between 9,512 and 9,814 trials. In the other three experiments, participants completed between 4,000 and 4,102 trials.ΒThe light-dark and faces experiments were broken into two sessions. Each session consisted of 2000 trials and lasted approximately 90 minutes or less. For the first 100 trials, the participant was reminded of the two target classes by a noise-free, high-contrast display of all the targets along with their class labels after every 20 trials. After the first 100 trials, the reminder displays appeared after every 100 trials. On each trial, the stimulus remained on the display until the participant responded. The participants were not instructed to answer as quickly as possible, but were told to trust their initial impression rather than spending many seconds or minutes studying each stimulus. Auditory feedback was provided after each trial that indicated whether the answer given was correct or incorrect. The experiments were implemented in MATLAB using the Psychtoolbox software (Brainard, 1997). Stimuli were presented on an Apple eMac computer positioned 1 m away from the participants and their head positions were controlled with a chin rest. There was no ambient light and the monitor was calibrated so stimuli could be presented at known brightness values. Observer responses were recorded using the computer keyboard.Β

In the light-dark and faces experiments, the contrast level was initialized to a high value and adjusted over the first 101 trials in each session using the

*stair-casing*algorithm (Macmillan & Creelman, 2005). Stair-casing was used to increase or decrease the target contrast level to keep the participant's performance near the 71% correct level. This level was chosen as a good balance between the need to explore responses to a large volume of the image stimulus space and the need to keep the participants engaged in the task. To make the trials statistically independent of one another, the contrast level for the remainder of the experiment was fixed at the mean of the contrasts of the final 20 stair-cased trials. The initial stair-casing trials in each session were discarded when fitting GRIFT models to the data.ΒThe stimulus generation and experimental procedures for the four-square and Kanizsa data sets were similar to those described above (see Gold et al., 2006, 2000, respectively). The one significant exception is that stair-casing adjustments to the signal contrast levels occurred throughout these experiments to ensure that participants' accuracy did not deviate significantly. Because the response on a trial can influence the contrast level of its successors, this method violates the assumption that trials are independent from one another. This dependence was ignored, however, when fitting GRIFT models to the results. In practice, after an initial adjustment period, participants' contrast levels stay nearly constant through an experimental session, therefore the trials can be reasonably treated as independent.Β

Detection experiment

The square-detection experiments used a similar stimulus generation procedure as the four-square experiment, but there was only one potential target in each condition and the observers' task was to determine if this target was present or if the image was purely composed of noise. Instead of responding with a binary classification, participants gave a rating from 1 (definitely absent) to 6 (definitely present).Β

Stimuli

Figure 3 shows the classes, targets, and a sample stimulus (target plus noise) from each response class or condition for each of the five experiments.Β

Figure 3

Figure 3

The light-dark experiment asked participants to distinguish between three strips that each had two light and one dark blob and three strips that each had one light and two dark blobs. Observers could successfully distinguish between the two groups either by relying on overall brightness (Class 1 stimuli were brighter than Class 2 stimuli) or by searching for individual light-dark patterns.Β

The faces task asked participants to distinguish between stimuli produced from two target faces (from Gold, Bennett, & Sekuler, 1999, unfiltered and down-sampled to 128 Γ 128 pixels). Classifying faces is a more natural visual task than classifying abstract patterns and faces can be distinguished at a relatively low resolution, which keeps the total number of parameters tractable. We wanted to investigate whether participants would process the faces holistically (Sergent, 1984), i.e., using a single classifier, or through the detection of multiple parts.Β

In the four-square experiment, participants were asked to distinguish between two stimulus classes, one in which there were bright squares in the upper-left or upper-right corners and one in which there were bright squares in the lower-left or lower-right corners. These classes can be linearly discriminated by comparing the overall brightness of the top and bottom pixels in each image, but observers may also pursue a multi-feature, non-linear strategy and attempt to detect the four possible bright corners independently. Previous analyses (Cohen et al., 2007; Gold et al., 2006) provide evidence that observers used multi-feature strategies in this experiment.Β

The Kanizsa-square experiment required observers to differentiate between two figures that produce slightly different illusory contours, i.e., perceived contours that are not actually present in the stimulus. The corners of the Class 1 and Class 2 targets are tilted to produce illusory vertical contours that are bowed outwards and inwards, respectively. Although the pixels of the illusory contours are identically distributed in each stimulus class and therefore cannot provide useful discriminative information, Gold et al.'s (2000) classification-image analysis indicated that participants used the pixels of the illusory contours to classify the stimuli. The participants focused mainly on the vertical contours. These contours, however, are separated in space. GRIFT was applied to these data to determine if the two illusory contours comprise a single feature or two separate features.Β

Finally, the square-detection experiment has three conditions: full, incomplete, and incomplete-rotated. In each condition, participants judged whether a single target is present or absent and responded with a confidence rating. In the full condition, the target is a square; in the incomplete condition, the target is the corners of the full square; and in the incomplete-rotated condition, the target is four corners that have been rotated to the same orientation in order to disrupt illusory contour effects.Β

Participants

The light-dark and faces experiments were each run on three observers. These participants were University of Massachusetts Amherst graduate students and the spouse of a UMass postdoctoral fellow. Participants were paid $11 per hour, with a $10 bonus for being the most accurate classifier for a particular experiment. The participants were naive to the underlying model and the fact that we were interested in finding multiple independent feature detectors.Β

The square-detection data were collected from two Indiana University undergraduates who were both naive to the purpose of the experiment. Four observers participated in the four-square experiment, two of whom (EA and RS) were naive to the purpose of discovering multiple independent detectors. Three observers participated in the Kanizsa experiment.Β

In addition to the human participants, three simulated observers were created to validate GRIFT's ability to recover feature detectors on the four-square task data. The

*top-vs.-bottom*observer classified images by comparing the brightness of the top and bottom pixels.^{6}Bright pixels on the top indicated Class 1. The*corners*observer classified images using four features, each sensitive to a particular corner brightness pattern. Bright pixels in the top-left or top-right corners indicated Class 1, bright pixels in the bottom-left or bottom-right corners indicated Class 2. The*combo*observer used all the features from both the top-vs.-bottom and corners observers. These three simulated observers were given examples at a fixed noise level and their parameters were adjusted so they classified the stimuli with accuracy similar to the human observers. Because the parameters were fixed throughout the experiments, no stair-casing of the stimuli was employed.ΒResults and Discussion

Figures 4β 8 and Tables 1β 13 display the results of fitting GRIFT models to the data from the previously described experiments. The most informative parameter values are the

*Ο*_{ i}s. Keep in mind that the*Ο*_{ i}s are not image pixel values that the features are attempting to match. Rather, they represent the weights of the linear classifiers that compose each recovered feature detector. The figures present these weights graphically for each value of*N*for each data set. By examining the pattern of positive and negative weights it is possible to determine what average brightnesses and contrasts are computed by each feature detector. Although the difference between the weights is informative, the sign of the weights is usually not significantβgiven a fixed number of features, there are typically several sets of features with identical log likelihoods that only differ from each other in the signs of their*Ο*terms and the associated*Ξ»*and*Ξ²*values.ΒFigure 4

Figure 4

Figure 5

Figure 5

Figure 6

Figure 6

Figure 7

Figure 7

Figure 8

Figure 8

Table 1

Table 1

Four-square | Top vs. bottom | Corners | Combo | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | ||

N = 1 | Ξ³ | 2.1 | β | β2.8 | β | β2.9 | β |

F _{1} | β4.0 | 0.78β£0.23 | 5.2 | 0.35β£0.67 | 5.7 | 0.24β£0.75 | |

N = 2 | Ξ³ | 1.1 | β | 4.7 | β | β5.4 | β |

F _{1} | 4.9 | 0.23β£0.78 | β3.7 | 0.76β£0.49 | 4.2 | 0.31β£0.79 | |

F _{2} | β3.8 | 0.93β£0.96 | β4.0 | 0.75β£0.45 | 4.4 | 0.40β£0.84 | |

N = 3 | Ξ³ | β5.8 | β | 5.6 | β | 2.0 | β |

F _{1} | 3.8 | 0.26β£0.76 | β3.5 | 0.73β£0.46 | β4.5 | 0.75β£0.28 | |

F _{2} | 4.3 | 0.26β£0.80 | β3.8 | 0.43β£0.20 | β4.7 | 0.72β£0.39 | |

F _{3} | 4.4 | 0.43β£0.24 | β4.2 | 0.67β£0.47 | 4.9 | 0.47β£0.74 | |

N = 4 | Ξ³ | 0.8 | β | 0.2 | β | β1.8 | β |

F _{1} | β4.5 | 0.74β£0.21 | β3.6 | 0.75β£0.52 | β4.0 | 0.69β£0.31 | |

F _{2} | 4.5 | 0.22β£0.76 | β3.6 | 0.47β£0.23 | β4.1 | 0.76β£0.40 | |

F _{3} | β3.5 | 0.59β£0.63 | 3.3 | 0.53β£0.68 | 4.8 | 0.52β£0.82 | |

F _{4} | 4.5 | 0.32β£0.19 | 3.8 | 0.23β£0.47 | 4.0 | 0.57β£0.83 | |

N = 5 | Ξ³ | β3.4 | β | β2.6 | β | 2.6 | β |

F _{1} | 5.0 | 0.17β£0.68 | β4.0 | 0.48β£0.24 | β4.5 | 0.39β£0.20 | |

F _{2} | 5.6 | 0.24β£0.81 | 4.0 | 0.22β£0.47 | β4.5 | 0.53β£0.34 | |

F _{3} | 3.9 | 0.71β£0.68 | 4.1 | 0.41β£0.64 | β4.2 | 0.72β£0.45 | |

F _{4} | β4.0 | 0.36β£0.41 | 4.1 | 0.22β£0.45 | 3.9 | 0.18β£0.39 | |

F _{5} | β3.8 | 0.66β£0.77 | β3.7 | 0.19β£0.28 | 4.1 | 0.22β£0.76 | |

N = 6 | Ξ³ | 3.2 | β | 2.0 | β | 2.4 | β |

F _{1} | β6.4 | 0.78β£0.21 | 4.6 | 0.26β£0.50 | β3.7 | 0.46β£0.15 | |

F _{2} | β3.4 | 0.61β£0.54 | β4.1 | 0.76β£0.53 | β4.7 | 0.52β£0.35 | |

F _{3} | 4.2 | 0.29β£0.15 | β4.4 | 0.61β£0.39 | β4.5 | 0.34β£0.20 | |

F _{4} | β4.0 | 0.55β£0.22 | 4.2 | 0.55β£0.80 | 4.0 | 0.26β£0.78 | |

F _{5} | 3.6 | 0.54β£0.52 | β4.0 | 0.17β£0.24 | 4.3 | 0.32β£0.59 | |

F _{6} | 4.3 | 0.10β£0.20 | β3.9 | 0.27β£0.20 | β3.8 | 0.69β£0.47 |

Table 2

Table 2

Four-squareSimulated Observer | Fit | N | |||||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | ||

Top vs. bottom | AIC | 3,758 | 3,813 | 3,812 | 3,845 | 3,864 | 3,995 |

LnL | β1,812 | β1,774 | β1,707 | β1,657 | β1,601 | β1,581 | |

Corners | AIC | 4,466 | 4,438 | 4,397 | 4,398 | 4,398 | 4,413 |

LnL | β2,166 | β2,086 | β2,000 | β1,934 | β1,868 | β1,810 | |

Combo | AIC | 3,534 | 3,549 | 3,510 | 3,553 | 3,585 | 3,623 |

LnL | β1,700 | β1,642 | β1,556 | β1,511 | β1,461 | β1,415 |

Table 3

Table 3

Four-square | AC | EA | JG | RS | |||||
---|---|---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | ||

N = 1 | Ξ³ | 2.5 | β | 2.1 | β | 2.9 | β | β2.6 | β |

F _{1} | β5.9 | 0.78β£0.27 | β4.2 | 0.72β£0.28 | β5.7 | 0.75β£0.27 | 4.6 | 0.32β£0.77 | |

N = 2 | Ξ³ | 3.3 | β | β2.8 | β | β1.6 | β | 2.9 | β |

F _{1} | β5.0 | 0.65β£0.19 | 3.2 | 0.25β£0.62 | β6.0 | 0.55β£0.16 | β4.6 | 0.50β£0.16 | |

F _{2} | β5.3 | 0.55β£0.15 | 3.4 | 0.24β£0.66 | 5.2 | 0.47β£0.85 | β4.8 | 0.45β£0.14 | |

N = 3 | Ξ³ | β3.2 | β | β5.3 | β | β2.6 | β | β2.8 | β |

F _{1} | β5.5 | 0.65β£0.26 | 4.1 | 0.34β£0.75 | 5.8 | 0.26β£0.69 | β4.7 | 0.36β£0.11 | |

F _{2} | 4.3 | 0.13β£0.57 | 4.1 | 0.35β£0.70 | 5.5 | 0.59β£0.89 | 5.1 | 0.37β£0.77 | |

F _{3} | 5.1 | 0.57β£0.83 | 4.2 | 0.21β£0.27 | β5.4 | 0.90β£0.69 | 5.0 | 0.12β£0.26 | |

N = 4 | Ξ³ | 4.2 | β | 3.7 | β | 0.6 | β | β0.9 | β |

F _{1} | β5.0 | 0.56β£0.18 | β4.6 | 0.78β£0.56 | β4.5 | 0.92β£0.65 | β5.5 | 0.88β£0.78 | |

F _{2} | β5.0 | 0.59β£0.14 | β3.9 | 0.62β£0.23 | β5.2 | 0.70β£0.27 | β4.4 | 0.64β£0.20 | |

F _{3} | β6.2 | 0.89β£0.70 | 4.1 | 0.24β£0.59 | 4.8 | 0.09β£0.30 | 5.0 | 0.67β£0.89 | |

F _{4} | 5.0 | 0.56β£0.83 | β5.2 | 0.10β£0.08 | 6.1 | 0.58β£0.86 | 4.1 | 0.65β£0.90 | |

N = 5 | Ξ³ | 1.6 | β | 2.7 | β | β1.7 | β | 2.3 | β |

F _{1} | 4.9 | 0.11β£0.33 | β4.8 | 0.72β£0.30 | 5.0 | 0.29β£0.78 | β5.0 | 0.87β£0.77 | |

F _{2} | β5.0 | 0.72β£0.25 | β4.7 | 0.21β£0.16 | β5.4 | 0.90β£0.70 | β4.6 | 0.43β£0.18 | |

F _{3} | 5.1 | 0.65β£0.88 | 4.5 | 0.39β£0.72 | β5.2 | 0.43β£0.16 | β4.5 | 0.70β£0.43 | |

F _{4} | β4.7 | 0.89β£0.61 | 4.7 | 0.21β£0.37 | 4.9 | 0.14β£0.34 | 4.5 | 0.28β£0.71 | |

F _{5} | β5.0 | 0.42β£0.17 | β4.3 | 0.74β£0.64 | 4.9 | 0.62β£0.86 | 4.4 | 0.71β£0.88 | |

N = 6 | Ξ³ | β7.4 | β | 1.4 | β | 2.9 | β | β6.7 | β |

F _{1} | 4.8 | 0.44β£0.78 | β5.0 | 0.52β£0.25 | β3.8 | 0.80β£0.35 | β4.9 | 0.68β£0.25 | |

F _{2} | β4.4 | 0.54β£0.14 | β4.2 | 0.79β£0.74 | β4.9 | 0.86β£0.62 | 5.1 | 0.81β£0.92 | |

F _{3} | 4.8 | 0.51β£0.81 | β4.2 | 0.81β£0.65 | β3.9 | 0.29β£0.12 | β4.7 | 0.73β£0.54 | |

F _{4} | β4.8 | 0.87β£0.69 | 4.8 | 0.29β£0.69 | β3.7 | 0.72β£0.25 | 4.9 | 0.65β£0.85 | |

F _{5} | 5.3 | 0.67β£0.89 | 4.8 | 0.67β£0.74 | 6.0 | 0.62β£0.87 | 4.0 | 0.59β£0.81 | |

F _{6} | 5.4 | 0.15β£0.40 | 4.2 | 0.17β£0.38 | 5.0 | 0.11β£0.25 | 5.2 | 0.14β£0.30 |

Table 4

Table 4

Four-squareParticipant | Fit | N | |||||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | ||

AC | AIC | 3,493 | 3,349 | 3,250 | 3,173 | 3,080 | 3,143 |

LnL | β1,680 | β1,542 | β1,426 | β1,322 | β1,209 | β1,174 | |

EA | AIC | 4,150 | 4,068 | 4,017 | 3,969 | 3,926 | 3,958 |

LnL | β2,008 | β1,901 | β1,810 | β1,720 | β1,632 | β1,582 | |

JG | AIC | 3,742 | 3,547 | 3,330 | 3,291 | 3,225 | 3,266 |

LnL | β1,804 | β1,640 | β1,466 | β1,381 | β1,282 | β1,236 | |

RS | AIC | 4,017 | 3,843 | 3,707 | 3,664 | 3,594 | 3,628 |

LnL | β1,942 | β1,788 | β1,655 | β1,567 | β1,466 | β1,417 |

Table 5

Table 5

Light-dark | PL1 | PL2 | PL3 | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | ||

N = 1 | Ξ³ | 2.0 | β | 2.3 | β | β2.0 | β |

F _{1} | β4.7 | 0.77β£0.23 | β5.4 | 0.68β£0.34 | 3.2 | 0.42β£0.52 | |

N = 2 | Ξ³ | β0.5 | β | 1.3 | β | β1.1 | β |

F _{1} | β6.0 | 0.78β£0.24 | 4.5 | 0.29β£0.63 | β3.7 | 0.30β£0.22 | |

F _{2} | 3.9 | 0.95β£0.89 | β4.5 | 0.87β£0.69 | 2.8 | 0.47β£0.54 | |

N = 3 | Ξ³ | 2.8 | β | β3.0 | β | β2.0 | β |

F _{1} | β6.7 | 0.78β£0.23 | β5.7 | 0.53β£0.40 | β3.6 | 0.62β£0.58 | |

F _{2} | 3.7 | 0.95β£0.89 | 6.1 | 0.32β£0.57 | 3.5 | 0.20β£0.24 | |

F _{3} | β3.3 | 0.93β£0.97 | 5.5 | 0.41β£0.54 | 3.9 | 0.69β£0.77 | |

N = 4 | Ξ³ | β2.2 | β | β1.9 | β | β1.8 | β |

F _{1} | 6.7 | 0.22β£0.77 | β2.0 | 0.62β£0.29 | β3.4 | 0.18β£0.12 | |

F _{2} | β3.7 | 0.05β£0.11 | β5.3 | 0.53β£0.40 | β3.6 | 0.14β£0.15 | |

F _{3} | β3.3 | 0.93β£0.97 | 5.4 | 0.33β£0.57 | 3.4 | 0.22β£0.22 | |

F _{4} | 2.0 | 1.00β£1.00 | 5.2 | 0.42β£0.53 | 3.7 | 0.35β£0.47 | |

N = 5 | Ξ³ | 1.2 | β | 0.0 | β | β1.5 | β |

F _{1} | β6.7 | 0.78β£0.23 | β1.9 | 0.62β£0.29 | β3.6 | 0.86β£0.80 | |

F _{2} | β3.7 | 0.05β£0.11 | 5.3 | 0.47β£0.60 | 3.3 | 0.18β£0.20 | |

F _{3} | 3.3 | 0.07β£0.03 | 5.4 | 0.33β£0.57 | β3.5 | 0.17β£0.11 | |

F _{4} | β2.0 | 0.00β£0.00 | β5.2 | 0.58β£0.47 | 3.4 | 0.83β£0.87 | |

F _{5} | 2.0 | 1.00β£1.00 | β2.0 | 1.00β£1.00 | 3.2 | 0.25β£0.27 |

Table 6

Table 6

Light-darkParticipant | Fit | N | ||||
---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | ||

PL1 | AIC | 3,427 | 3,452 | 3,524 | 3,624 | 3,724 |

LnL | β1,662 | β1,625 | β1,611 | β1,611 | β1,611 | |

PL2 | AIC | 4,088 | 4,147 | 3,999 | 4,095 | 4,195 |

LnL | β1,993 | β1,973 | β1,849 | β1,847 | β1,847 | |

PL3 | AIC | 5,029 | 5,081 | 5,131 | 5,214 | 5,288 |

LnL | β2,464 | β2,439 | β2,414 | β2,406 | β2,393 |

Table 7

Table 7

Faces | PF1 | PF2 | PF3 | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | ||

N = 1 | Ξ³ | β3.3 | β | β2.8 | β | 3.1 | β |

F _{1} | 6.8 | 0.38β£0.64 | 5.4 | 0.48β£0.51 | β5.8 | 0.52β£0.47 | |

N = 2 | Ξ³ | β1.5 | β | β0.9 | β | 2.9 | β |

F _{1} | 6.9 | 0.38β£0.64 | 5.4 | 0.48β£0.51 | β5.6 | 0.76β£0.73 | |

F _{2} | β1.9 | 1.00β£1.00 | β1.9 | 1.00β£.100 | 5.7 | 0.28β£0.34 | |

N = 3 | Ξ³ | 0.6 | β | β0.8 | β | 6.6 | β |

F _{1} | 6.9 | 0.38β£0.64 | 5.4 | 0.48β£0.51 | β5.6 | 0.76β£0.73 | |

F _{2} | β1.9 | 1.00β£1.00 | β2.0 | 1.00β£1.00 | β5.7 | 0.72β£0.66 | |

F _{3} | β2.0 | 1.00β£1.00 | β1.9 | 0.00β£0.00 | 1.9 | 1.00β£1.00 |

Table 8

Table 8

FacesParticipant | Fit | N | ||
---|---|---|---|---|

1 | 2 | 3 | ||

PF1 | AIC | 5,081 | 5,968 | 6,857 |

LnL | β2,095 | β2,095 | β2,095 | |

PF2 | AIC | 5,992 | 6,880 | 7,768 |

LnL | β2,551 | β2,551 | β2,551 | |

PF3 | AIC | 5,908 | 6,643 | 7,528 |

LnL | β2,509 | β2,432 | β2,431 |

Table 9

Table 9

Kanizsa | AJR | AMC | JMG | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{2} | ||

N = 1 | Ξ³ | 4.5 | β | β3.6 | β | 4.1 | β |

F _{1} | β8.8 | 0.55β£0.40 | 7.6 | 0.43β£0.61 | β7.8 | 0.56β£0.37 | |

N = 2 | Ξ³ | β2.3 | β | 3.8 | β | 4.0 | β |

F _{1} | 8.8 | 0.45β£0.60 | β7.6 | 0.57β£0.39 | β7.8 | 0.56β£0.37 | |

F _{2} | β2.0 | 1.00β£1.00 | 0.5 | 0.44β£0.45 | 0.6 | 0.18β£0.19 | |

N = 3 | Ξ³ | β2.9 | β | β3.7 | β | 2.2 | β |

F _{1} | 6.0 | 0.47β£0.57 | 7.6 | 0.43β£0.61 | β7.9 | 0.56β£0.37 | |

F _{2} | 6.0 | 0.47β£0.57 | β2.0 | 0.00β£0.00 | 0.4 | 0.87β£0.87 | |

F _{3} | β6.0 | 0.54β£0.43 | β2.0 | 0.00β£0.00 | 1.6 | 1.00β£1.00 |

Table 10

Table 10

KanizsaParticipant | Fit | N | ||
---|---|---|---|---|

1 | 2 | 3 | ||

AJR | AIC | 12,765 | 14,019 | 15,379 |

LnL | β5,755 | β5,755 | β5,807 | |

AMC | AIC | 12,469 | 13,723 | 14,977 |

LnL | β5,606 | β5,606 | β5,606 | |

JMG | AIC | 12,538 | 13,793 | 15,046 |

LnL | β5,641 | β5,642 | β5,641 |

Table 11

Table 11

Square-detection | PS1-full | PS1-inc | PS1-incrot | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | ||

N = 1 | Ξ³ _{1} | 0.8 | β | 1.0 | β | 3.4 | β |

Ξ³ _{2} | 0.3 | β | 0.1 | β | β0.1 | β | |

Ξ³ _{3} | β1.9 | β | β2.3 | β | β2.9 | β | |

Ξ³ _{4} | β5.3 | β | β6.0 | β | β5.6 | β | |

Ξ³ _{5} | β6.1 | β | β7.5 | β | β9.3 | β | |

F _{1} | 4.9 | 0.26β£0.70 | 5.6 | 0.34β£0.76 | 5.5 | 0.48β£0.79 | |

N = 2 | Ξ³ _{1} | 0.1 | β | 0.1 | β | 1.5 | β |

Ξ³ _{2} | β0.5 | β | β1.3 | β | β3.6 | β | |

Ξ³ _{3} | β3.6 | β | β4.4 | β | β5.6 | β | |

Ξ³ _{4} | β7.9 | β | β9.1 | β | β9.9 | β | |

Ξ³ _{5} | β9.0 | β | β11.2 | β | β14.6 | β | |

F _{1} | 6.3 | 0.28β£0.69 | 3.2 | 0.46β£0.45 | 4.1 | 0.79β£0.55 | |

F _{2} | 2.7 | 0.37β£0.43 | 7.1 | 0.33β£0.76 | 7.3 | 0.40β£0.78 | |

N = 3 | Ξ³ _{1} | 5.4 | β | β0.6 | β | 9.5 | β |

Ξ³ _{2} | 4.3 | β | β3.5 | β | 4.5 | β | |

Ξ³ _{3} | 0.3 | β | β7.4 | β | 2.1 | β | |

Ξ³ _{4} | β4.7 | β | β13.6 | β | β2.5 | β | |

Ξ³ _{5} | β6.3 | β | β16.9 | β | β7.3 | β | |

F _{1} | β4.6 | 0.52β£0.55 | 2.8 | 0.14β£0.46 | β7.7 | 0.59β£0.22 | |

F _{2} | 7.3 | 0.27β£0.69 | 5.9 | 0.51β£0.44 | β2.6 | 0.26β£0.11 | |

F _{3} | β2.0 | 0.65β£0.37 | 8.8 | 0.33β£0.76 | 4.2 | 0.84β£0.57 | |

N = 4 | Ξ³ _{1} | 5.7 | β | 5.0 | β | 1.1 | β |

Ξ³ _{2} | 4.5 | β | 2.1 | β | β5.3 | β | |

Ξ³ _{3} | 0.6 | β | β2.1 | β | β7.7 | β | |

Ξ³ _{4} | β4.6 | β | β8.2 | β | β12.9 | β | |

Ξ³ _{5} | β6.2 | β | β11.7 | β | β18.4 | β | |

F _{1} | β1.3 | 0.20β£0.07 | 2.6 | 0.12β£0.44 | β1.7 | 0.84β£0.64 | |

F _{2} | β4.6 | 0.50β£0.53 | β6.2 | 0.49β£0.57 | 2.3 | 0.76β£0.91 | |

F _{3} | β1.9 | 0.72β£0.46 | 1.2 | 0.60β£0.80 | 5.6 | 0.80β£0.53 | |

F _{4} | 7.3 | 0.27β£0.69 | 8.8 | 0.33β£0.76 | 8.5 | 0.40β£0.77 |

Table 12

Table 12

Square-detection | PS2-full | PS2-inc | PS2-incrot | ||||
---|---|---|---|---|---|---|---|

Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | Ξ» _{ i} | P _{ i} ^{1}β£ P _{ i} ^{6} | ||

N = 1 | Ξ³ _{1} | 2.4 | β | 2.7 | β | 2.5 | β |

Ξ³ _{2} | 0.3 | β | 0.5 | β | 0.2 | β | |

Ξ³ _{3} | β1.7 | β | β1.5 | β | β1.4 | β | |

Ξ³ _{4} | β4.6 | β | β4.8 | β | β4.3 | β | |

Ξ³ _{5} | β7.7 | β | β7.6 | β | β7.0 | β | |

F _{1} | 5.6 | 0.21β£0.79 | 5.8 | 0.19β£0.80 | 5.8 | 0.25β£0.75 | |

N = 2 | Ξ³ _{1} | 2.2 | β | 6.9 | β | 2.3 | β |

Ξ³ _{2} | β0.1 | β | 4.5 | β | β0.1 | β | |

Ξ³ _{3} | β3.1 | β | 1.7 | β | β2.1 | β | |

Ξ³ _{4} | β6.3 | β | β1.6 | β | β6.1 | β | |

Ξ³ _{5} | β12.9 | β | β7.4 | β | β10.5 | β | |

F _{1} | 4.5 | 0.21β£0.65 | β4.4 | 0.79β£0.31 | 4.0 | 0.14β£0.45 | |

F _{2} | 7.4 | 0.16β£0.70 | 6.8 | 0.14β£0.68 | 7.4 | 0.23β£0.71 | |

N = 3 | Ξ³ _{1} | 9.4 | β | 2.2 | β | 3.8 | β |

Ξ³ _{2} | 7.1 | β | β0.4 | β | 1.0 | β | |

Ξ³ _{3} | 3.9 | β | β3.3 | β | β1.2 | β | |

Ξ³ _{4} | 0.9 | β | β7.1 | β | β5.2 | β | |

Ξ³ _{5} | β6.0 | β | β12.3 | β | β9.6 | β | |

F _{1} | 4.3 | 0.23β£0.68 | 4.9 | 0.05β£0.36 | β2.1 | 0.51β£0.30 | |

F _{2} | β7.2 | 0.84β£0.31 | 6.6 | 0.21β£0.75 | 7.4 | 0.25β£0.72 | |

F _{3} | 4.1 | 0.02β£0.24 | 2.6 | 0.31β£0.68 | 4.3 | 0.09β£0.37 | |

N = 4 | Ξ³ _{1} | 7.8 | β | 15.7 | β | 13.2 | β |

Ξ³ _{2} | 4.3 | β | 12.4 | β | 9.6 | β | |

Ξ³ _{3} | 1.1 | β | 9.3 | β | 7.4 | β | |

Ξ³ _{4} | β1.8 | β | 5.7 | β | 3.1 | β | |

Ξ³ _{5} | β8.9 | β | β0.5 | β | β1.6 | β | |

F _{1} | β4.6 | 0.14β£0.08 | β4.7 | 0.94β£0.64 | β1.8 | 0.45β£0.23 | |

F _{2} | β4.2 | 0.73β£0.26 | β6.7 | 0.82β£0.31 | β8.0 | 0.75β£0.28 | |

F _{3} | 7.2 | 0.16β£0.68 | β3.9 | 0.80β£0.38 | β4.3 | 0.91β£0.62 | |

F _{4} | 4.2 | 0.02β£0.25 | 3.2 | 0.80β£0.95 | 3.9 | 0.85β£0.90 |

Table 13

Table 13

Square-detectionParticipant | Fit | N | |||
---|---|---|---|---|---|

1 | 2 | 3 | 4 | ||

PS1-full | AIC | 11,930 | 12,002 | 12,092 | 12,221 |

LnL | β5,894 | β5,869 | β5,843 | β5,842 | |

PS1-inc | AIC | 12,232 | 12,277 | 12,302 | 12,422 |

LnL | β6,045 | β6,001 | β5,948 | β5,942 | |

PS1-incrot | AIC | 10,978 | 10,830 | 10,910 | 11,014 |

LnL | β5,418 | β5,278 | β5,252 | β5,238 | |

PS2-full | AIC | 11,612 | 11,510 | 11,568 | 11,662 |

LnL | β5,735 | β5,618 | β5,581 | β5,562 | |

PS2-inc | AIC | 11,582 | 11,474 | 11,517 | 11,610 |

LnL | β5,720 | β5,600 | β5,556 | β5,536 | |

PS2-incrot | AIC | 12,027 | 11,903 | 11,982 | 12,072 |

LnL | β5,942 | β5,814 | β5,788 | β5,767 |

The tables contain each model's

*Ξ³*value and the*Ξ»*_{ i}values associated with each feature detector. Large*Ξ»*_{ i}values indicate that a feature's detection will greatly influence the classification decision. Some feature detectors associated with large*Ξ»*_{ i}s, however, might have very little influence. For example, a feature detector's*Ο*_{ i}and*Ξ²*_{ i}values might be such that the feature detector is never activated by any of the stimuli, rendering the detector useless no matter what*Ξ»*_{ i}is associated with it. Therefore, the tables also list estimates for*P*(*F*_{ i}= 1β£*C*= 1) and*P*(*F*_{ i}= 1β£*C*= 2), the probabilities of feature detector*F*_{ i}firing given a particular classification responses. If both of these values are near 0, the stimuli almost never activate this feature detector. If they are both near 1, the feature detector is always active and therefore is acting as an additional threshold term to the linear function in*P*(*C*β£*F*). Feature detectors whose firing probabilities differ depending on the observer's classification decision are the most useful for modeling those decisions. The*Ξ²*_{ i}values are not reported because they are usually not informative without knowledge of the exact*Ο*_{ i}values, which are presented graphically. The tables also present the AIC and data log likelihood values for each GRIFT model and data set.ΒFour-square

We present the results on the four-square data first because they are easy to understand and because the simulated observers provide an important validation of the GRIFT approach. The result of using GRIFT to recover the feature detectors for the four-square experimental data are given in Figure 4 and Tables 1β 4. GRIFT models with 1β6 features were fit to the data from each of the three simulated and four human observers.Β

Note that the constant gray regions between the stimuli corners (see Figure 3) were discarded before the experimental data were analyzed by GRIFT. Because these stimulus regions are always constant it is reasonable to assume that they are not used in the classification process. These areas, however, do provide significant visual separation between the corners. To incorporate this visual separation into GRIFT, the neighborhood functions for the

*P*(*Ο*_{ i}) distributions ( Equation 9) were adjusted so that they would not penalize the assignment of very different weights across corner boundaries.ΒAcross all observers, the best one-feature model (left column of Figure 4) was based on the contrast between the top and bottom of the image. In the figure, positive and negative weights are represented by red and blue colors, respectively. Recall that, within a feature, the signs of the weights are generally not meaningful, so that, after an appropriate shift of other parameters, a positive top and negative bottom feature is equivalent to a negative top and positive bottom feature. The important factor is that one area has a large weight and the other has a large weight of the opposite polarity, indicating a contrast feature that is sensitive to the presence or absence of relative brightness in those regions. It is interesting to note that this result is extremely similar to the result produced by classification images of the data, reinforcing the strong similarity between one-feature GRIFT and that approach.Β

The results of fitting GRIFT to the three simulated observers, top-vs.-bottom, corners, and combo, demonstrates that when GRIFT generates the data, the correct features can be reliably recovered. For data generated by the top-vs.-bottom observer, for all values of

*N,*GRIFT correctly recovered one or more feature detectors sensitive to the contrast between the top and bottom of the stimulus. It is important to note that, even though the stimuli were generated from images with corners, GRIFT does not hypothesize the existence of any corner-sensitive feature detectors, even for large*N*. That is, GRIFT recovered the feature detectors used to generate the responses, not the features used to generate the stimuli. The minimum AIC value is for*N*= 1.ΒGRIFT also recovers the appropriate feature detectors for data generated by the corners observer. As

*N*increases, corner-sensitive features appear. When*N*= 4, each of the four feature detectors is sensitive to the presence or absence of a different corner of the stimulus, matching the strategy of the corners observer. For*N*> 4 the GRIFT models find four corner feature detectors and fill the remaining slots with uninterpretable, noisy feature detectors. Examining the AIC values in Table 2 indicates that the adjusted fit values are virtually identical for 3 β€*N*β€ 5. The AIC value for*N*= 3 is the lowest by a very small margin, probably because with the appropriate*Ξ³,*it is difficult to distinguish between three and four-corner strategies in these data. That is, if an observer uses three corner detectors, and they all fail, he or she can default to assuming that the fourth would have succeeded without actually computing it (for a similar issue regarding top-vs.-bottom see Footnote 6). The feature activation probabilities for*N*= 5 indicate that, compared to the activity frequencies of the four corner features, the extraneous fifth feature is seldom active for either response class.ΒAnalysis of the data generated by the combo observer appropriately reveals the presence of both corner and top-vs.-bottom detectors, especially for

*N*= 5 & 6. This recovery is significant because, although the features used in the two strategies spatially overlap, GRIFT was still able to separate them out from the classification data. The minimum AIC was for*N*= 3, possibly indicating that*N*= 3 is a more compact representation of the combo strategy or that AIC penalizes complexity too harshly in some cases, further highlighting the view that features recovered by GRIFT should be used as a starting point for further experiments.ΒGRIFT revealed that all four human observers applied multi-feature strategies.

^{7}The minimal AIC values ( Table 4) were all for*N*= 5. Looking at the GRIFT parameters, however, reveals important differences in strategies. Of all the human observers, JG's detector weights ( Figure 4) show the clearest corner patterns and this participant also had the largest decline in data log likelihood as*N*increased. On the other end of the spectrum, the corner detection patterns are least visible in EA's data, and this participant also exhibited the smallest improvement in data log likelihood as*N*increased. AC and RS are between these two extremes on both the visual corner pattern and log likelihood spectra. Interestingly, this GRIFT analysis suggests that at least some of the human observers (e.g., AC & JG) used a hybrid strategy, i.e., both corner features and an overall difference in brightness between the top and bottom halves of the stimuli. It is potentially noteworthy that one of the non-naive observers, JG, exhibited the strongest indications of a corner detection strategy, while one of the naive observers, EA, exhibited the weakest indications of this strategy.ΒLight-dark

Three participants, PL1, PL2, and PL3, were run with the light-dark stimuli. Although PL1 and PL2 performed near the expected accuracy level (82% and 73%), PL3 performed near chance (55%). Because the noise levels were fixed after the first 101 trials, a participant with good luck at the end of that calibration period could experience very high noise levels for the remainder of the experiment, leading to poor performance.

^{8}Regardless, all three participants appear to have used different classification methods, providing a very informative contrast, and so the data from all three participants are provided. The results of fitting the GRIFT model to the participants' data are given in Figure 5 and Tables 5 and 6.ΒThe AIC values and feature detection probabilities indicate that PL1 used a one-feature strategy, linearly classifying the stimuli by measuring their overall brightness. Although we knew that the targets allowed for successful classification using this method, this result was surprising because it implies that the observer was able to maintain a roughly constant brightness threshold across the stimulus and across time. It was expected that such a strategy would be more challenging than the within-image comparisons that enabled a linear strategy on the four-square stimuli.Β

PL2, on the other hand, clearly employed a non-linear, multi-feature strategy. For

*N*= 1 and*N*= 2, the most interpretable feature detector is sensitive to overall stimulus brightness. This brightness detector disappears when*N*= 3 and the best-fit model consists of three detectors, each sensitive to one of the three positions a bright or dark spot can appear. The detectors of the*N*= 3 model only outperform the overall brightness detector if they are all presentβthey are jointly, but not singly, informative. When*N*= 4 the overall brightness detector reappears, added to the three pattern detectors. Increasing to*N*= 5 adds a useless fifth feature detector. The AIC scores indicate that the*N*= 3 model is the best fit to the data, further confirming that this observer used a multi-feature strategy.ΒThe GRIFT models of participant PL3 had minimum AIC for

*N*= 1 and mostly recovered noisy weight patterns and detectors that exhibit a small difference in activation probabilities between the two classes. The one-feature model is probably the best fit, and because performance was extremely low, it can be assumed that the participant was reduced to near random guessing much of the time.ΒThe clear distinction between the GRIFT fits for all three observers demonstrate the effectiveness of GRIFT in distinguishing between different classification strategies.Β

Faces

The faces experiment presented the largest computational challenge. After the experiment, the stimuli were down-sampled further to 32 Γ 32 and the background surrounding the faces was removed by cropping, reducing the stimuli to 26 Γ 17. These steps were necessary to make the EM algorithm computationally feasible, and to reduce the number of model parameters so they would be sufficiently constrained by the samples.Β

The results for three participants are given in Figure 6 and Tables 7 and 8. Participants PF1 and PF2's data were clearly best fit by one-feature GRIFT models. Increasing the number of features simply caused the algorithm to add detectors that were never or always active. As explained previously, such feature detectors are superfluous because they can be eliminated or absorbed into the

*Ξ³*term. PF1's one-feature model clearly places significant weight near the eyebrows, nose, and other facial features. PF2's one-feature weights are much noisier and harder to interpret. This might be related to PF2's poor performance on the taskβonly 53% accuracy compared to PF1's 72% accuracy. Perhaps the noise level was too high and PF2 was guessing rather than using image information much of the time. PF1's detector was active for 38% of Class 1 responses and 64% of Class 2 responses, a relatively large difference in activation frequency indicating a very predictive feature. PF2's detector was active for 48% of Class 1 responses and 51% of Class 2 responses, a very small difference indicating that this feature is not very predictive of PF2's responses.ΒParticipant PF3's data produced a genuine two-feature GRIFT model, albeit one that is difficult to interpret. The weights present in the two-feature model are very different from those in the one-feature model, and the weight patterns in the two detectors are subtly different from one another. The Class 1 face has a left eyebrow that is darker than its right eyebrow and both feature detectors compute a brightness contrast between the left and right eye regions of an input stimulus. They also both place large weights near the nose and around parts of the mouth areas and the second feature detector places weights that correspond to the left boundary between the face and the gray surrounding pixels in the noise-free targets. The faces differ in nose and mouth structure as well as in the brightness of the forehead, cheek, and chin regions and these weights may indicate sensitivity to those differences. Regardless, none of PF3's

*N*= 2 detectors had large differences between their Class 1 and Class 2 activation frequencies and, as with PF1 and PF2, PF3's minimum AIC score was for the one-feature model.ΒOverall the results on faces support the hypothesis that face classification is generally holistic and configural, rather than the result of individual part classification, especially when detection of individual features is difficult, as was the case in this experiment (Sergent, 1984).Β

Kanizsa

The GRIFT models fit to the Kanizsa experimental data confirm many of the results in Gold et al. (2000). These results can be seen in Figure 7 and Tables 9 and 10. The stimuli were downsampled to 25 Γ 25 pixels to make the model-fitting algorithm computationally tractable. According to GRIFT, the observers mainly relied on pixels from the vertical, but not the horizontal, illusory contours when classifying the stimulus. According to the best-fit GRIFT models, AJR strongly used both contours, JMG relied only on the left contour, and AMC appeared to make less overall use of the contour pixels. Gold et al. reached the same conclusion using a classification image analysis. Our results deviate from the previous work in assigning substantial weight to the horizontal lips of the four three-quarter circles, while the classification images of Gold et al. indicated that these pixels were not used in classification. It is possible that this difference arises from our decision to use a Markov random field prior probability distribution to smooth the weights during the parameter-fitting process, while Gold et al. applied a smoothing filter after calculating the classification image. This difference warrants further investigation to determine which model is more accurate.Β

We had speculated that perhaps the two contours were detected separately and independently, but the GRIFT analysis does not support that hypothesis. GRIFT did not produce a multi-feature model for any of the observers suggesting that, when present, both illusory contours were processed as a single feature. For all participants, AIC was lowest for

*N*= 1. When*N*> 1, GRIFT generated models with only one useful feature, except for observer AJR with*N*= 3. However, examining Table 10 reveals that this model has worse AIC and log likelihood values than AJR's*N*= 1 model. Therefore, adding two feature detectors with*Ξ»*_{2}=*Ξ»*_{3}= 0 to the*N*= 1 model would result in a three-feature model with better AIC and log likelihood than the*N*= 3 model discovered by the EM optimization algorithm. The*N*= 1 model, however, would still be preferable because it uses fewer parameters to get the same likelihood, producing a better AIC score. Therefore, AJR's*N*= 3 result is a clear case of the EM procedure not finding the globally optimal parameter values, but it serves as a demonstration of the utility of the AIC and log likelihood values in detecting such problems.ΒSquare detection

The results from fitting GRIFT to the square-detection data are reported in Figure 8 and Tables 11β 13. The previous experiments were all classification experiments in which observers only gave one of two responses to every trial. The square-detection experiment required participants to provide a rating from 1 (target definitely absent) to 6 (target definitely present). Therefore, there are 6

*P*(*F*_{ i}= 1β£*C*=*r*) values for each feature. However, because the values tend to increase, decrease, or stay constant as*r*increases, we summarize them by only reporting*P*(*F*_{ i}= 1β£*C*= 1) and*P*(*F*_{ i}= 1β£*C*= 6) in Tables 11 and 12.ΒThe AIC values indicate that participant PS1 used one feature detector in the full and incomplete conditions, but used two feature detectors for the incomplete-rotated condition. The AIC values for participant PS2 indicated two detectors for all the conditions. In the full and incomplete conditions, both participants' feature detectors for their AIC-minimizing models were sensitive to the contours (real or illusory) connecting the corners. This corresponds well to the illusory contour sensitivity demonstrated in the Kanizsa data analysis. Participant PS2's two-feature models in these conditions consisted of detectors that were sensitive to different regions of the square, but neither contained detectors sensitive to particular corners.Β

For both participants, the models for the incomplete-rotated condition were qualitatively different than those observed for the other two conditions. Participant PS1's data was best-fit by a two-feature model in which the largest weights were on the corners, although both features still placed some weight on the (supposedly disrupted) illusory contour regions. Participant PS2's best-fit model had one detector that focused on detecting the presence of the upper-left corner and one feature sensitive to the tops of both upper corners. These results lead to the conclusion that rotating the contours did significantly disrupt the illusory contours and greatly reduced their effect on stimulus detection. The striking qualitative differences between this condition and the full and incomplete conditions indicate that the illusory contour influence discovered by Gold et al. (2000) is also present in detection tasks. In this experiment, the participants were sensitive to different pieces of the illusory contour, but the horizontal contours were influential, while they were not influential in the Kanizsa classification data. It is also interesting that participant PS1 only showed evidence of a multi-feature strategy in the incomplete-rotated case. This is a type of qualitative strategy change that would be invisible in a classification image analysis.Β

General discussion

This article has described the GRIFT model for determining potential features used in human image classification. GRIFT is a Bayesian network that describes classification as the combination of multiple, independently detected features. GRIFT provides a generative, probabilistic model of classification that can incorporate prior knowledge and assumptions about these features and account for human data.Β

GRIFT models classification as a two-stage process in which the output of a set of independent feature detectors are pooled to produce a classification. Such a two-stage organization is not unique to GRIFT and has been used in many other psychophysical and neurological models. For example, Pelli et al. (2003) created a two-stage model for word recognition in which the outputs of independent letter detectors are combined to create the perception of a word. Rust, Mante, Simoncelli, and Movshon (2006) developed a model that represented the response of MT neurons to motion as the result of combining the outputs of multiple V1 neurons. This model structure is analogous to GRIFT's assumption of multiple feature detectors that mediate between the raw visual input and the classification decision. Similarly, Anzai, Peng, and Van Essen (2007) model the receptive fields of V2 neurons as the result of different methods of combining V1 neuron outputs.Β

The experimental data used by GRIFT are compatible with the original classification-image method. In fact, the four-square and Kanizsa human participant data were originally analyzed using that algorithm. One of the advantages of GRIFT is that it allows the reanalysis of old data to reveal new information; fitting multi-feature GRIFT models can reveal previously hidden non-linear classification strategies.Β

As mentioned, a one-feature GRIFT model is very similar to the classic classification-image model of classification. In both cases, a linear combination of pixel values is compared to a threshold. There are, however, a number of differences between the two models. In the classification image model, the threshold is a normally distributed random variable, which accounts for human classification inconsistency. In GRIFT, the threshold is not random, but is wrapped, along with the weighted pixel sum, in a logistic regression function ( Equations 3 and 4) which accounts for randomness in feature detection. The feature detector output is passed to a second logistic regression function that determines classification ( Equations 5 and 6), which is a second source of randomness with no equivalent in the single-step classification process modeled by a classification image.Β

Another contrast between GRIFT and classification image analysis is that GRIFT parameters are fit using the full stimuli displayed to the participants, while the classification image algorithm only operates on the noise field present in the stimuli. Using the full stimuli is convenient because it removes the requirement of storing target-free noise fields, or the data necessary to construct them, during an experiment. The classification image algorithm also requires the true class label of each stimulus, while GRIFT only relies on the participants' responses. These advantages result from GRIFT's use of Bayesian networks and the EM optimization algorithm. It is possible to construct a Bayesian network describing the classification image model that could also be optimized using full stimuli and without requiring the true class labels. Further theoretical and empirical work would be necessary to determine if this style of optimization produces results equivalent to the traditional Ahumada (2002) method.Β

Perhaps the most salient difference in the one-feature case is the use of prior probabilities on the parameters. While the classification image algorithm aims to maximize the likelihood of the data, GRIFT, as described above, also factors in prior beliefs about the classification process. Such priors can be advantageous, particularly when the stimulus images have many pixels. In these cases, simply maximizing the likelihood might not sufficiently recover the true parameters, either because noise in the data will have too great an influence or because there are many possible solutions with nearly equivalent likelihoods. Gold et al. (2000) dealt with this problem by smoothing their classification images to eliminate noise. GRIFT achieves a similar result by applying the aforementioned Markov random field prior on the

*Ο*_{i}parameters. Whereas the practice of smoothing classification images requires some manual estimation of the correct amount of smoothing in each instance, in the prior probability approach, the influence of the prior automatically declines as more data are gathered. Despite these technical differences, in practice, we have found the result of fitting a one-feature GRIFT model to be extremely similar to the result of fitting a classification image.ΒOne of the strengths of the Bayesian approach is that it allows researchers to alter the model to reflect their assumptions. The prior distributions on the parameters can easily be changed to reflect knowledge gained in previous experiments. Furthermore, extending the feature detector model to include simple non-linearities (squaring the weighted sum of the pixels, for example) or to use alternative probability distributions could be combined with the appropriate priors to encourage the formation of edge detectors, Gabor filters, or other biologically motivated features.Β

The graphical model approach also allows new versions of GRIFT based on different feature parameterizations that may be useful in various situations. In the current implementation, the number of parameters scales linearly with the size of the stimulus images. Fitting the model to the classification of very large stimuli might require an impractical number of sample classifications or extraordinary computational resources. A possible solution to such problems is to adopt new feature parameterizations. One possibility would be to replace the per-pixel weights with a few parameters designating image regions in which all pixels should receive an identical weight. For example, in the four-square task, a feature sensitive to bright top corners might simply describe the height, width, and location of a rectangular region and assign all the pixels in this region a weight of β1 and all the pixels outside this region a weight of 1. This type of parameterization is highly compatible with the assumptions that neighboring feature weights are similar. For our four-square stimuli, this parameterization would replace 64 independent parameters with 6; for larger stimuli, the savings are even greater. These changes in parameterization simply require changing the conditional probability functions of the features to use the new parameters, and calculating a few related derivatives so the optimization code functions correctly. Describing feature weights geometrically would also allow us to encode prior distributions on the feature positions and allow those positions to vary from trial to trial, which could imbue the features with greater and lesser degrees of translational and rotational invariance.Β

GRIFT's success on traditional classification image data also leaves open the question of analyzing other types of experiments. The Bubbles method (Gosselin & Schyns, 2001), for example, uses a very different noise model that may, in some cases, be more natural than adding Gaussian noise to every pixel. Although the Bubbles technique has been criticized for lacking the theoretical rigor of classification images with white Gaussian pixel noise (Murray & Gold, 2004), the GRIFT model, which provides a mathematically clear model of classification and which does not assume that the noise is white and Gaussian, might provide a useful framework for analyzing Bubbles experiments.Β

GRIFT, like the classification image method, assumes that observer responses are the result of a consistent strategy. It is more likely, however, that participants continue to refine the features they use as an experiment progresses. Although invisible to a single-feature model, the multi-feature GRIFT model can indicate the presence of these changes. For example, in the square-detection experiment GRIFT recovered a two feature-detector model for participant PS1 in the incomplete-rotated condition. One of these features consisted of weights on all four corners, but was also sensitive to pixel values between the corners. The second feature was more exclusively focused on the corner pixels. Hypothesizing that these two detectors, which detect similar image structures, might be the result of PS1 pursuing different strategies at different times, we split the data chronologically in half and fit one-feature GRIFT models to each part. As demonstrated in Figure 9, the feature detector weights for the two halves strongly resemble the two features recovered from the full data set. This result suggests that the participant's search for evidence became more localized as the experiment progressed. GRIFT successfully detected evidence of this transition. Other data that were best fit by multi-feature GRIFT models were similarly examined, but they lacked a clear correspondence between full-data and half-data feature detectors. This result indicates that in some cases multiple feature detectors are used simultaneously, while in others they indicate shifts in strategy.Β

Figure 9

Figure 9

Explicit modeling of this change in classification strategy over time is a very promising direction for future research. One potential approach is to alter the model so that the feature detector weights and other parameters are allowed to change, subject to some reasonable constraints, between trials. By relaxing the assumption that observers employ a constant classification strategy across time, a dynamic model would provide a more realistic representation of the processes used in the task and could provide better explanations of many data sets. The success of this first version of GRIFT on human data provides a firm foundation for such future developments and we are optimistic about the model's future utility.Β

Appendix A

The GRIFT algorithm

The goal of the algorithm is to find the parameter values that best satisfy the prior distributions and best account for the ( which in turn implies thatΒ Assume that there is a prior estimate for the parameters,

*S, C*) samples gathered from a human observer. Mathematically, this corresponds to finding the mode of*P*(*Ο, Ξ², Ξ», Ξ³*β£ S, C), where S and C refer to all of the observed samples. The algorithm is derived from the expectation-maximization (EM) method, a widely used optimization technique for dealing with hidden variables (Dempster, Laird, & Rubin, 1977, also see Bishop, 2006, or Gelman et al., 2004), in this case, F, the feature detector outputs for all the trials. To maximize*P*(*ΞΈ*β£S, C), where*ΞΈ*= (*Ο, Ξ², Ξ», Ξ³*), observe thatΒ$P(\Xi \u0388\beta \x81\u2019|\beta \x81\u2019S,C)=P(F,\Xi \u0388\beta \x81\u2019|\beta \x81\u2019S,C)P(F\beta \x81\u2019|\beta \x81\u2019S,C,\Xi \u0388),$

(A1)

$log(P(\Xi \u0388\beta \x81\u2019|\beta \x81\u2019S,C))=log(P(F,\Xi \u0388\beta \x81\u2019|\beta \x81\u2019S,C))\beta \x88\x92log(P(F\beta \x81\u2019|\beta \x81\u2019S,C,\Xi \u0388)).$

(A2)

*ΞΈ**, which implies a distribution*P*(Fβ£S, C,*ΞΈ**) that can be calculated from the GRIFT model using Equation 14. If we compute the expectation of Equation A2 with respect to this distribution, the left-hand side is unaffected because it does not depend on F. On the right-hand side,*E*(log(*P*(Fβ£S, C,*ΞΈ*))) is maximal for*ΞΈ*=*ΞΈ**, so any choice of*ΞΈ*that increases*E*(log(*P*(F,*ΞΈ*β£S, C))) will increase log(*P*(*ΞΈ*β£S, C)).ΒTherefore, the EM algorithm for the GRIFT model consists of choosing random initial parameters which is proportional to

*ΞΈ** = (*Ο**,*Ξ²**,*Ξ»**,*Ξ³**) and then finding the*ΞΈ*that maximizes Β$Q(\Xi \u0388, \Xi \u0388 *)= \beta \x88\x91 FP(F\beta \x81\u2019|\beta \x81\u2019S,C, \Xi \u0388 *)log\beta \x81\u2019(P(C,F,S\beta \x81\u2019|\beta \x81\u2019\Xi \u0388))+log\beta \x81\u2019(P(\Xi \u0388))$

(A3)

*E*(log(*P*( F,*ΞΈ*β£ S, C))) because log(*P*( F,*ΞΈ*β£ S, C)) = log(*P*( C, F, Sβ£*ΞΈ*)) + log(*P*(*ΞΈ*)) β log(*P*( S, C)) and log(*P*( S, C)) does not depend on*ΞΈ*or F.ΒThe

*ΞΈ*that maximizes*Q*then becomes*ΞΈ** for the next iteration, and the process is repeated until convergence. The presence of both the*P*( C, F, Sβ£*ΞΈ*) and*P*(*ΞΈ*) terms encourages the algorithm to find parameters that explain the data and match the assumptions encoded in the parameters' prior distributions. As the amount of available data increases, the relative influence of the priors decrease, so it is possible to discover feature detectors that violate prior beliefs given enough evidence.ΒUsing the joint probability distribution of the GRIFT model, Β dropping the P( S) term, which is independent of the parameters, and the log(

$ Q ( \Xi \u0388 , \Xi \u0388 * ) \beta \x88\x9d \beta \x88\x91 F P ( F \beta \x81\u2019 | \beta \x81\u2019 S , C , \Xi \u0388 * ) ( log \beta \x81\u2019 ( P ( C \beta \x81\u2019 | \beta \x81\u2019 F , \Xi \xbb ) ) + \beta \x88\x91 i = 1 N log \beta \x81\u2019 ( P ( F i | \beta \x81\u2019 S , \Omicron \x89 i , \Xi \xb2 i ) ) ) + \beta \x88\x91 i = 1 N log \beta \x81\u2019 ( P ( \Omicron \x89 i ) ) + log \beta \x81\u2019 ( P ( \Xi \xbb i ) ),$

(A4)

*P*(*Ξ²*_{ i})) and log(*P*(*Ξ³*)) terms, which are 0 because*P*(*Ξ³*) =*P*(*Ξ²*_{ i}) = 1. As mentioned before, the normalization constants for the log(*P*(*Ο*_{ i})) elements can be ignored during optimizationβthe log makes them additive constants to*Q*. The functional form of every additive term is described in the GRIFT model section, and*P*( Fβ£ S, C,*ΞΈ**) can be calculated using the model's joint probability function.ΒEach iteration of EM requires maximizing

*Q,*but it is not possible to compute the maximizing*ΞΈ*in closed form. Fortunately, it is relatively easy to search for the best*ΞΈ*. Because*Q*separates into many additive components, it is possible to efficiently compute its gradient with respect to each of the elements of*ΞΈ*and use this information to find a locally maximum*ΞΈ*assignment with the scaled conjugate gradient descent algorithm (Bishop, 1995). Even a locally maximum value of*ΞΈ*usually provides good EM resultsβ*P*(*Ο, Ξ², Ξ», Ξ³*β£S, C) is still guaranteed to improve after every iteration.ΒThe result of any EM procedure is only guaranteed to be a locally optimal answer, and finding the globally optimal

*ΞΈ*is made more challenging by the large number of parameters. GRIFT adopts the standard solution of running EM many times, each instance starting with a random*ΞΈ**, and then accepting the final*ΞΈ*from the instance that produced the most probable parameters. For this model and the data presented in the paper, 20β30 random restarts were sufficient.ΒThe results of the GRIFT-fitting algorithm are relatively insensitive to the procedure for randomly initializing

*ΞΈ**. Some methods encouraged faster EM convergence or lead to a greater percentage of successful restarts depending on the stimulus. The*Ξ»** parameters were initialized by random samples from a normal distribution and then half were negated so the features would tend to start evenly assigned to the two classes, except for*Ξ³**, which was initialized to 0. In the four-square, light-dark, Kanizsa, and square-detection experiments, the*Ο** parameters were initialized from a uniform distribution. In the faces experiments, the*Ο** parameters were initialized by adding normal noise to the optimal linear classifier separating the two targets. Because of the large number of pixels in the faces stimuli, the other initialization procedures frequently produced initial assignments with extremely low probabilities, which led to slow EM convergence and an excess of local maxima. The*Ξ²** were set to the optimal threshold for distinguishing the classes using the initial*Ο** as a linear classifier (except when they were accidentally set to the negation of this valueβwhich did not appear to cause any problems). Altering the initialization procedure did not appear to change results. In most cases, it only affected the speed of EM or the number of restarts required to reliably discover the best parameter values.ΒThe GRIFT model is non-convex, therefore it is theoretically possible that there exist multiple near-optimal sets of parameters for a given data set, where each set is represented by a different local maximum of the posterior distribution. This has not proven to be a practical problemβin our experience, when a set of restarts leads to multiple solutions with very similar posterior probabilities, those solutions are qualitatively similar. Typically, their parameters only differ by trivial amounts or by their signs. As described in the Results and Discussion section, parameter sets that are identical except for differences in sign are functionally identical or equivalent to one another.Β

Acknowledgments

This research was supported by NSF Grant SES-0631602 to A. L. Cohen. M. G. Ross was supported by NIMH grant MH16745.Β

The authors thank Florin Cutzu, Arnab Dhua, Jason Gold, Michelle Greene, Tom Griffiths, Erik Learned-Miller, Richard Murray, Adam Sanborn, Richard Shiffrin, Mark Steyvers, and Chen Yu for helpful discussions, information, ideas, and insights. We especially thank Jason Gold for co-designing, conducting, and providing the data for the square-detection experiment.Β

The authors also thank the editor and anonymous reviewers for many helpful suggestions that improved the article.Β

Portions of this research were previously published as βGRIFT: A graphical model for inferring visual classification features from human dataβ in Advances in Neural Information Processing Systems 20 (2008), the proceedings of the 2007 Neural Information Processing Systems (NIPS) conference.Β

Commercial relationships: none.Β

Corresponding author: Michael G. Ross.Β

Email: mgross@mit.edu.Β

Address: Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA.Β

Footnotes

Footnotes

^{1}Β Maintaining the requirement that

*Ξ³*

_{ r}β€

*Ξ³*

_{ rβ1}during model fitting is inconvenient. An alternative parameterization is

*Ξ³*

_{ r}=

*Ξ³*

_{1}β

$ \beta \x88\x91 z = 2 r$

*Ξ±*

_{ z}

^{2}for all

*r*β₯ 2. Any choice for

*Ξ³*

_{1}and (

*Ξ±*

_{2},

*Ξ±*

_{3}, β¦,

*Ξ±*

_{ Rβ1}) will be equivalent to

*Ξ³*

_{ r}parameters with the desired ordering.

Footnotes

^{2}Β The results reported below are not specific to this particular choice of prior. We also investigated a version of GRIFT with a unimodal prior on

*Ξ»*. In particular, Equation 11 was replaced with a normal distribution,

*P*(

*Ξ»*

_{ i}) =

$ 1 2 \beta \x81\u2019 \Omicron \x80 \Omicron \x83 2$

exp(β $ \Xi \xbb i 2 2 \Omicron \x83 2$

), with the standard deviation, *Ο,*set to both 1 and 2. In both cases, GRIFT produced results (on the data from the corners simulated observer in the four-square experiment discussed below) that were qualitatively and quantitatively similar to GRIFT implemented with the prior defined by Equation 11.

Footnotes

^{3}Β If a zero-mean normal prior is applied to the

*Ξ»*parameters (see Footnote 2), always active or never active features do not tend to appear. Instead, GRIFT produces features with

*Ξ»*

_{ i}= 0 (the value that maximizes the

*Ξ»*prior). This result conveys the same informationβthat the model has too many feature detectors.

Footnotes

^{4}Β In addition to AIC, several alternative model-selection approaches were tried and each approach was judged based on its ability to indicate the correct number of features on the four-square simulated data. Prior work (Ross & Cohen, 2008) measured the mutual information between the feature detectors and the classifications, but, unlike AIC, this approach required a subjective judgment of the model size at which the mutual information curve appeared to be leveling off. The Bayesian information criterion (BIC) (Schwarz, 1978) penalized model complexity too heavily. Four-fold and five-times repeated two-fold (Dietterich, 1998) cross validation were unreliable given the size of the data set. Leave-one-out cross validation might have been successful, but was not computationally tractable given the current implementation. Because GRIFT uses improper priors (discussed previously), the Bayesian marginal likelihood (see Bishop, 2006) approach was not available.

Footnotes

^{5}Β The truncation ensured that the stimulus pixel values remained within the display's output range.

Footnotes

^{6}Β Our simulation employed two features, one that fires for Class 1 patterns (top brighter than bottom) and one that fires for Class 2 patterns (bottom brighter than top). It turns out, however, that these two features are logically equivalent to a GRIFT model that contains only one top-bottom contrast-sensitive feature and an appropriate

*Ξ³*.

Footnotes

^{7}Β Initials, rather than participant numbers, are used to facilitate comparison with past work.

Footnotes

^{8}Β Although the results of a poorly performing observer provide an informative contrast in this instance, we suggest that, in most cases, researchers should avoid this issue by continuing to stair-case stimulus contrast levels throughout an experiment.

References

Agresti, A.
(2002).

*Categorical data analysis*. New York: Wiley-Interscience.
Ahumada, A. J.Jr.
(2002). Classification image weights and internal noise level estimation.

*Journal of Vision*, 2, (1):8, 121β131, http://journalofvision.org/2/1/8/, doi:10.1167/2.1.8. [PubMed] [Article] [CrossRef]
Akaike, H.
(1974). A new look at the statistical model identification.

*IEEE Transactions on Automatic Control*, 19, 716β723. [CrossRef]
Anzai, A.
Peng, X.
Van Essen, D. C.
(2007). Neurons in monkey visual area V2 encode combinations of orientations.

*Nature Neuroscience*, 10, 1313β1321. [PubMed] [CrossRef] [PubMed]
Besag, J.
(1974). Spatial interaction and the statistical analysis of lattice systems.

*Journal of the Royal Statistical Society: B (Methodological)*, 36, 192β236.
Bishop, C. M.
(1995).

*Neural networks for pattern recognition*. New York: Oxford University Press.
Bishop, C. M.
(2006).

*Pattern recognition and machine learning*. New York: Springer.
Borg, I.
Groenen, P.
(1997).

*Modern multidimensional scaling: Theory and applications*. New York: Springer.
Brainard, D. H.
(1997). The Psychophysics Toolbox.

*Spatial Vision*, 10, 433β436. [PubMed] [CrossRef] [PubMed]
Cohen, A. L.
Shiffrin, R. M.
Gold, J. M.
Ross, D. A.
Ross, M. G.
(2007). Inducing features from visual noise.

*Journal of Vision*, 7, (8):15, 1β14, http://journalofvision.org/7/8/15/, doi:10.1167/7.8.15. [PubMed] [Article] [CrossRef] [PubMed]
Dempster, A. P.
Laird, N. M.
Rubin, D. B.
(1977). Maximum likelihood from incomplete data via the EM algorithm.

*Journal of the Royal Statistical Society, Series B (Methodological)*, 39, 1β38.
Dietterich, T. G.
(1998). Approximate statistical tests for comparing supervised classification learning algorithms.

*Neural Computation*, 10, 1895β1923. [PubMed] [CrossRef] [PubMed]
Forsyth, D. A.
Ponce, J.
(2003).

*Computer vision: A modern approach*. Upper Saddle River: Prentice Hall.
Gelman, A.
Carlin, J. B.
Stern, H. S.
Rubin, D. B.
(2004).

*Bayesian data analysis*. Boca Raton: Chapman & Hall/CRC.
Geman, S.
Geman, D.
(1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 6, 721β741. [CrossRef] [PubMed]
Gold, J.
Bennett, P. J.
Sekuler, A. B.
(1999). Identification of band-pass filtered letters and faces by human and ideal observers.

*Vision Research*, 39, 3537β3560. [PubMed] [CrossRef] [PubMed]
Gold, J. M.
Cohen, A. L.
Shiffrin, R.
(2006). Visual noise reveals category representations.

*Psychonomics Bulletin & Review*, 13, 649β655. [PubMed] [CrossRef]
Gold, J. M.
Murray, R. F.
Bennett, P. J.
Sekuler, A. B.
(2000). Deriving behavioural receptive fields for visually completed contours.

*Current Biology*, 10, 663β666. [PubMed] [Article] [CrossRef] [PubMed]
Gosselin, F.
Schyns, P. G.
(2001). Bubbles: A technique to reveal the use of information in recognition tasks.

*Vision Research*, 41, 2261β2271. [PubMed] [CrossRef] [PubMed]
Macmillan, N. A.
Creelman, C. D.
(2005).

*Detection theory: A user's guide*. Philadelphia: Lawrence Erlbaum Associates.
Murray, R. F.
Gold, J. M.
(2004). Troubles with bubbles.

*Vision Research*, 44, 461β470. [PubMed] [CrossRef] [PubMed]
Palmer, S. E.
(1999).

*Vision science: Photons to phenomenology*. Cambridge: The MIT Press.
Pearl, J.
(1988).

*Probabilistic reasoning in intelligent systems: Networks of plausible inference*. San Mateo, CA: Morgan Kaufmann.
Pelli, D. G.
Farell, B.
Moore, D. C.
(2003). The remarkable inefficiency of word recognition.

*Nature*, 423, 752β756. [PubMed] [CrossRef] [PubMed]
Ross, M. G.
Cohen, A. L.
Platt,, J. C.
Koller,, D.
Singer,, Y.
Roweis, S.
(2008).
GRIFT: A graphical model for inferring visual classification features from human data.

*Advances in neural information processing systems*. (20, pp. 1217β1224). Cambridge: The MIT Press.
Rust, N. C.
Mante, V.
Simoncelli, E. P.
Movshon, J. A.
(2006). How MT cells analyze the motion of visual patterns.

*Nature Neuroscience*, 9, 1421β1431. [PubMed] [CrossRef] [PubMed]
Schwarz, G.
(1978). Estimating the dimension of a model.

*The Annals of Statistics*, 6, 461β464. [CrossRef]
Sergent, J.
(1984). An investigation into component and configural processes underlying face perception.

*British Journal of Psychology*, 75, 221β242. [PubMed] [CrossRef] [PubMed]