We present new experimental and mathematical techniques aimed at determining the features used in visual object recognition. We conceive of these features as the parts of an object that are treated as unitary wholes when recognizing or discriminating visual objects. For example, consider a task classifying a visual target presented in pixel noise as a “P” or a “Q”. The features may correspond to particular shapes of the target letters. Two such features for “P”, for example, might be a vertical line and upper-right-facing curve. The decision may be encoded in terms of particular values of such features, and an appropriate combination of these values may determine how the expression is perceived. We utilize recent advances in statistical machine learning techniques to uncover the features used by human observers.

*k*may be nonzero for many different

*k*s, as long as the sum of these probabilities equals one. GMMs are a natural way to discover visual similarities that allow a collection of images to be grouped, and, as we shall see, these groups correspond closely to the image features.

*N*images,

*D*pixels each and

*K*clusters, let

*p*

_{ nk}be the (posterior) probability that image

*k*; (

*μ*

_{ k}, Σ

_{ k}) be the mean and covariance of Gaussian

*k,*respectively; and

*π*

_{ k}be the prior probability that a point is generated by cluster

*k.*Then,

- For each data image$ Y \u02dc n$and each cluster
*k,*compute the probability of image$ Y \u02dc n$under cluster*k*as the multivariate Gaussian density given by$P( Y \u02dc n| \mu k, \Sigma k)=(2\u2062\pi ) \u2212 D / 2| \Sigma k | \u2212 1 / 2exp ( \u2212 1 2 ( Y \u02dc n \u2212 \mu k ) T \Sigma k \u2212 1 ( Y \u02dc n \u2212 \mu k ) ).$(2) - For each cluster
*k,*update the parameters for the corresponding Gaussian by computing the mean and covariance of the data points, where the contribution of each point is weighted by the probability with which it belongs to*k*.$ \mu k= \Sigma n = 1 N p n \u2062 k Y \u02dc n \Sigma n = 1 N p n \u2062 k$(4)$ \Sigma k= \Sigma n = 1 N p n \u2062 k ( Y \u02dc n \u2212 \mu k ) ( Y \u02dc n \u2212 \mu k ) T \Sigma n = 1 N P n \u2062 k$(5) - Repeat, starting at Step 1, until the parameter estimates do not change significantly from one repetition to the next.

*x*- and

*y*-axes of the figure correspond to the grayscale values for the first and second pixels, respectively. Targets A and B are shown in blue and red, respectively. As described previously, 1,000 stimuli were generated, 500 from each target.

*L*(F

_{l}), that each feature, F

_{l}, is present in the stimulus (relative to random noise) is calculated. Using a Bayesian-inspired decision rule in which the likelihood ratio of each feature is weighted based on how well the feature discriminates the two targets, these likelihood ratios are then combined within each response category (

*L*(A) and

*L*(B) for Response Categories A and B, respectively). More weight is given to more diagnostic features. The observer selects the response with the greatest combined likelihood ratio. Some features may be sensitive only to a part of a target, as is the case for the Category A feature, which ignores Pixel 1 and matches a stimulus based solely on Pixel 2. Details of this decision model for the simulated observer are given in the 1.

*P*(A) =

*L*(A)/(

*L*(A) +

*L*(B)), and likewise, for “B”,

*P*(B) =

*L*(B)/(

*L*(A) +

*L*(B)) = 1 −

*P*(A). Assuming the features are sufficiently distinct, high-confidence stimuli tend to be produced by high activation of a single feature.

Targets included? | High confidence only? | Category | ln( L) |
---|---|---|---|

No | Yes | A | 34,759 |

B | 30,961 | ||

No | A | 185,655 | |

B | 164,865 | ||

Yes | Yes | A | 55,602 |

B | 30,058 | ||

No | A | 176,828 | |

B | 157,883 |

Observer | Category | Number of features per category | |
---|---|---|---|

1 | 2 | ||

Observer 1 | Top | 3,348 | 3,481 |

Bottom | 3,292 | 3,421 | |

Observer 2 | Top | 3,582 | 3,631 |

Bottom | 3,918 | 3,970 | |

Observer 3 | Top | 3,327 | 3,482 |

Bottom | 3,376 | 3,571 | |

Participant 1 | Top | −108,955 | −108,893 |

Bottom | −104,369 | −104,315 | |

Participant 2 | Top | −101,314 | −101,253 |

Bottom | −106,749 | −106,688 | |

Participant 3 | Top | −121,100 | −121,040 |

Bottom | −87,088 | −87,032 | |

Participant 4 | Top | −99,750 | −99,689 |

Bottom | −107,971 | −107,911 |

*X*

_{ mK}be the

*m*th target from response category

*K*. Let

*X*

_{ mK}

^{ i,j}be the (

*i,*

*j*)th gray-level pixel value from

*X*

_{ mK}.

*K*response categories. Let

*n*th test image. Let

*i,*

*j*)th gray-level pixel value from test image

*Y*

_{ n}be the original image from which

*σ*< .25 and

*N*is distributed as a Gaussian with mean 0 and SD

*σ*. To ensure that

*σ,*1 − 2 ·

*σ*], and second, resample if

*N*(0,

*σ*) ∉ (−2

*σ,*2

*σ*).

*X*

_{ mK}and

*K,*

*X*

_{ mK}and

*F*

_{ l}be feature

*l*. Let

*F*

_{ l}

^{ i,j}be the (

*i,*

*j*)th gray-level pixel value of feature

*l*. Each

*F*

_{ l}is the same size as

*X*

_{ mK}and

*F*

_{ l}

^{ i,j}contributes differently to the detection of

*F*

_{ l}. Let,

*w*

_{ F l}

^{ i,j}be the weight of pixel (

*i,*

*j*) in feature

*l,*where 0 ≤

*w*

_{ F l}

^{ i,j}and

*w*

_{ F l}

^{ i,j}= 0 means that

*F*

_{ l}

^{ i,j}does not contribute to

*F*

_{ l}. For simplicity, we may want to assume that

*w*

_{ F l}

^{ i,j}is 0 or 1, but it is not necessary. Note that as currently implemented, there is no advantage to having both

*i*and

*j*indices. They were both included because later we may want to include constraints such as spatial contiguity of feature pixels. For simplicity, it is assumed that there is no internal noise in the detection of features.

*Y*

_{ n}contained

*F*

_{ l}divided by the probability that

*F*

_{ l}was produced from uniform random noise in a range defined by the observer's internal estimate,

*σ*.

*N*(

*x*∣0,

*x*of a normal probability density function with mean 0 and SD

*K*and

*J*) of one exemplar each described in the text, the relative degree of match of a feature to category

*K*is given by

*J*is the contrast category. One potential generalization of Equation A4 to multiple categories and multiple exemplars per category can be generated by summing over all of the targets for each response category and then summing over each response category in the denominator.

*K*can be given by

*α*and

*β*are parameters used to scale how evidence affects the decision and

*J*ranges over all categories. The probability given in Equation A5 was used as a stand-in for a measure of confidence. In the simulations reported above, category

*K*was selected if

*σ*= .22 in the two-pixel simulations, .235 in the four-square simulations, or .2239 for the more complicated feature structure;

*α*= 1; and

*β*= 1.