To clarify the presentation, it is important to define the terms “pixel,” “image,” and “feature.” A pixel is a square region of a stimulus that has a homogeneous brightness value. In an experiment, each stimulus pixel may be displayed using one or more computer monitor pixels depending on the spatial scaling applied to the stimulus—all references to “pixels” will refer to units of the stimulus, rather than those of the display hardware. An image is a collection of grayscale pixels arranged in a two-dimensional matrix. Images can be indexed by row and column. The two-dimensional location of particular pixels, however, is often unimportant. In this work, it is sometimes simpler to use a single pixel index. For example, the 20 th pixel in a stimulus will be referred to as S 20.
The definition of the final term, “feature,” is undoubtedly the most controversial. Researchers have defined feature in many different ways, but in this paper a feature is a pixel pattern detected by a linear classifier. A linear classifier is a set of weights assigned to each image pixel along with a threshold value. If the weighted sum of an image's pixels is less than the threshold, the feature is considered present in the image, otherwise the feature is absent. Linear classifiers can define many useful types of features, including those based on the contrast between different image regions. Because the human visual system is contrast-sensitive (Palmer,
1999) and contrast features can indicate the presence or absence of specific pixel patterns, such linear features are particularly informative. Extensions of GRIFT to different types of features are discussed below. Under this definition of feature, the classification image model, which models classification as the result of a single linear classifier, is a single-feature model, while GRIFT is a multi-feature model.
GRIFT represents the classification process as a
Bayesian network (Pearl,
1988), which is a type of
graphical model. Physicists, statisticians, and computer scientists developed graphical models to describe the probabilistic interactions between variables. Graphical models are fundamental to modern research in artificial intelligence (e.g., Bishop,
2006) and computer vision (e.g., Forsyth & Ponce,
2003) and are playing a larger role in psychology (e.g., Rehder,
2003). The adjective “graphical” refers to the fact that these models are commonly represented by a graph in which nodes indicate variables and edges connecting the nodes indicate the influence of one variable on another. Bishop (
2006) contains full details on all the types of graphical models discussed in this paper.
A Bayesian network (typically shortened to “Bayes net”) is a causal graphical model. It represents each variable as a node, and connects them with directed edges (lines with arrows at one end) that point from a cause to an effect. The diagram in
Figure 1 represents the causal relationships between the variables of the GRIFT model. The GRIFT model describes each classification,
C, as the result of a stimulus,
S, which is processed by a set of
N feature detectors,
F = {
F 1,
F 2, …,
F N}, that are instantiated as linear classifiers. Because there is an arrow from the stimulus
S to each
F i,
S is a
parent of each
F i, and
S directly influences the probability distribution of their values. Similarly, each
F i directly influences
C. No other variables directly influence each other. In particular,
S only influences
C through the
F is and the
F is do not directly influence each other. Variables that do not directly influence each other are called
conditionally independent. Bayes nets efficiently model the interaction of many variables by encoding assumptions about their conditional independence.
Because an image can be represented as a vector of pixel values, each image is a point in a high-dimensional space. The linear classifier associated with each feature describes a boundary in this image space that separates images that have that feature from images that do not. That is, each feature detector is sensitive to a particular type of input—for example, feature detector F 1 might respond best to the vertical line of a ‘P’, while feature detector F 2 might respond best to the circle in a ‘Q’. The F i variables are binary and indicate if a feature has been detected ( F i = 1) or not ( F i = 0). Whereas a typical linear classifier is deterministic, the feature detectors in GRIFT are probabilistic. An image that falls near the boundary will have an approximately 50% chance of activating the feature detector, an image far to one side of the boundary will produce nearly a 100% probability of activation, and an image far to the other side will produce nearly a 0% probability of activation. Therefore, a clear, low noise, high contrast image of a ‘P’ will almost certainly cause F 1 = 1, but a noisy, low-contrast image of a ‘Q’ might only have a small chance of causing F 2 = 1.
The feature activations influence the classification, C, of a stimulus. In most of this paper, we assume two response classes, but an extension to ratings is described below and other generalizations are possible. The presence of some features increases the probability of choosing Class 1 ( C = 1) and the presence of others increases the probability of choosing Class 2 ( C = 2). Returning to the ‘P’ and ‘Q’ example, activation of the vertical line feature would increase the probability of responding ‘P’. Activation of a circle feature would increase the probability of responding ‘Q’.
Mathematically, the Bayes net representation is useful because it is an efficient representation for the joint probability distribution of all the model variables. For a Bayes net with variables
V = {
V 1,
V 2, …,
V M}, the joint probability distribution is
where
P(
V i∣parents(
V i)) is the conditional probability distribution of
V i given its parents in the graph. Therefore, in the GRIFT model,
This factorization generally provides a much more efficient representation than would be possible if the joint distribution were represented without any assumptions about causality or conditional independence.
Now that the structure of the model is established, the next task is to specify its component conditional probability distributions: P( S), P( C∣ F), and P( F i∣ S). The distribution of the stimuli, P( S), is under the control of the experimenter. Fitting the GRIFT model to experimental data only relies on the assumption that, across trials, the stimuli sampled from this distribution are independent and identically distributed.
The conditional distribution of each feature detector's value,
P(
F i∣
S), is modeled as a logistic regression function on the pixel values of
S. Because the feature detectors are assumed to be linear classifiers, each feature distribution is governed by two parameters, a weight vector
ω i and a threshold −
β i, such that
and
where ∣
S∣ is the number of pixels in a stimulus and
ω ij is the
j th element of vector
ω i. The logistic regression function satisfies the probabilistic classification properties outlined above: Stimuli near the boundary are classified less deterministically than those far from the boundary. In image pixel space, the weights define the orientation of the boundary that determine the presence or absence of the feature. They also determine the degree of probabilistic behavior that the detector will exhibit—weights with larger absolute values lead to more deterministic output. The weights and threshold jointly determine the probability that a feature is detected for a particular image. There is a 50% probability that a feature will be detected when
ω ij S j = −
β i. When
ω ij S j > −
β i, i.e., the image is on the “absent” side of the linear boundary, there is less than 50% chance that the feature will be detected. Likewise, when
ω ij S j < −
β i, i.e., the image is on the “present” side of the linear boundary, there is a greater than 50% chance that the feature will be detected.
The conditional distribution of
C is represented by a logistic regression function on the feature outputs. Therefore, the conditional distribution of
C is determined by a weight vector
λ and a threshold −
γ that determine the impact of feature activations on the probability of a particular response, such that
and
Detecting a feature with negative
λ i increases the probability that the observer will respond “Class 1” and detecting a feature with positive
λ i increases the probability that the observer will respond “Class 2.” Note that
γ serves the same role as
β i in
Equation 3.
Equations 5 and
6 can be generalized to represent a conditional probability distribution over ratings rather than classifications. Instead of only allowing two responses based on the feature detector values, ratings allow an observer to respond with an integer between 1 and
R, where 1 indicates “definitely class 1,”
R indicates “definitely class 2,” and the values in between indicate intermediate degrees of belief. Because they are ordered, rating probabilities can be represented with ordinal logistic regression (Agresti,
2002). The probability of a responding with a rating less than or equal to
r, where 1 ≤
r ≤
R − 1, is given by
in which
γ is a vector with
R − 1 elements for which
γr ≤
γr−1.
1 The probability of rating
R is therefore
the probability of rating 1 is
and, for any other rating,
If
R = 2, these equations correspond exactly to the binary classification probabilities described in
Equations 5 and
6.
Figure 2 shows the full GRIFT model, including the parameters. To avoid clutter, the figure uses
plate notation, in which duplicated model structures are drawn once and enclosed in a box. The ‘N’ in the lower right corner indicates that all the variables in the box and their connections to other variables are duplicated once for each of the features in the model. Note that in
Figure 2 the parameters are represented as parents of the previously described GRIFT variables. In accordance with the techniques of Bayesian statistics (Gelman, Carlin, Stern, & Rubin,
2004), the parameters are themselves treated as random variables.
Given data from an observer, i.e., a set of trials, each represented by a stimulus, S, and a response, C, the goal is to find the GRIFT parameter values that best account for the data. This parameter search is computationally complex. Even for small images and few features, this model has many parameters: Each ω i is a vector with as many dimensions as there are pixels in S and each feature also contributes a β i and λ i parameter. The fact that the F i variables are hidden, i.e., not directly measurable, also substantially increases the challenge of fitting this model to data. The primary advantage of the Bayesian approach is that it provides a principled way to place constraints on the parameters that make model fitting practical given a reasonable amount of data.
In Bayesian models, constraints on parameters are represented by prior probability distributions. The priors represent our beliefs about the parameters before any experimental evidence is gathered. After data are gathered, knowledge about each parameter is represented by a posterior distribution, which is conditioned on all the observed data. The posterior describes the combined influence of the prior assumptions and the model likelihood given the observed data. As more data are gathered, the influence of the priors decreases.
The prior on each
λ i parameter reflects the assumption that each feature should have a significant impact on the classification, but no single feature should make the classification deterministic. In particular, the prior is a mixture of two normal distributions with means at ±2,
This prior has a number of desirable characteristics. First, if the density of a prior is too concentrated, its influence on the results will be very strong unless there are a lot of data. Each component normal distribution in
Equation 7 has unit variance, which makes the distributions broad enough that the data collected in our experiments will largely determine the parameter estimates. Second, if
λ i ≈ 0,
F i's output will not significantly influence
C. In contrast, if any
λ i is too far from zero, a single active feature can make the response nearly deterministic. To avoid these extremes, most of the mass of the prior should be significantly far from zero, but not concentrated at large positive or negative values. The means of the component normals, −2 and 2, determined empirically, satisfy these constraints.
2
Because the best
γ is largely determined by the
λ is and the distributions of
F and
S, γ has a non-informative prior,
This constant prior indicates no preference for any particular value. Although this function does not integrate to 1 as
γ ranges from negative to positive infinity, and is therefore not a true probability density, it can be used as an
improper prior (Gelman et al.,
2004). Improper priors are an accepted Bayesian statistical technique so long as they produce normalized posterior distributions. In GRIFT,
P(
γ) has no effect on the posterior distributions of the parameters, as demonstrated in the
1, and therefore it is an acceptable improper prior. Analogously,
P(
βi) = 1 for all
i.
Because each
ω i vector has dimensionality equal to the number of pixels in a stimulus, these parameters present the biggest inferential challenge. As mentioned previously, human visual processing is sensitive to contrasts between image regions. If one image region is assigned positive
ω ijs and another is assigned negative
ω ijs, the feature detector will be sensitive to the contrast between them. This contrast between regions requires all the pixels within each region to share similar
ω ij values. To encourage this local structure and reduce the difficulty of recovering the
ω is, the prior distribution was designed to favor assigning similar weights to neighboring pixels. Each
ω i parameter has a prior distribution given by
where
A is the set of neighboring pixel locations in the stimulus. This density function has two elements. The first term is a mixture of two normal distributions. The components have modes at 1 and −1, respectively, and each has unit variance. The combination assigns roughly equal probability to
ω ij values between −1 and 1, but unlike a uniform prior, it places some probability mass at every value and therefore allows
ω ij values to lie outside that range. The second component increases as the weights assigned to neighboring pixels become more similar and decreases as they become more different. This type of probability function is known as a
Markov random field (Besag,
1974, see also Bishop,
2006), a class of graphical model frequently used in computer vision and physics. Geman and Geman (
1984) pioneered the use of MRFs in computer vision as a model for reconstructing noisy images. For the purpose of fitting the model, there is no need to normalize this distribution because the normalization is constant with respect to
ωi.
Using
Equation 1 to combine all of the parameters and variables into a single probability distribution, the model is described by