We present a Bayesian version of J. Lacroix, J. Murre, and E. Postma's (2006) Natural Input Memory (NIM) model of saccadic visual memory. Our model, which we call NIMBLE (NIM with Bayesian Likelihood Estimation), uses a cognitively plausible image sampling technique that provides a foveated representation of image patches. We conceive of these memorized image fragments as samples from image class distributions and model the memory of these fragments using kernel density estimation. Using these models, we derive class-conditional probabilities of new image fragments and combine individual fragment probabilities to classify images. Our Bayesian formulation of the model extends easily to handle multi-class problems. We validate our model by demonstrating human levels of performance on a face recognition memory task and high accuracy on multi-category face and object identification. We also use NIMBLE to examine the change in beliefs as more fixations are taken from an image. Using fixation data collected from human subjects, we directly compare the performance of NIMBLE's memory component to human performance, demonstrating that using human fixation locations allows NIMBLE to recognize familiar faces with only a single fixation.

*recognition*in the sense used in the experimental psychology literature. It refers to the ability to discriminate previously seen faces from novel faces, based on a study list. In contrast, we use face

*identification*to refer to the ability to identify face images as particular individuals). NIM is an exemplar model of memory (Raaijmakers & Shiffrin, 2002), in that it stores memories as points in a vector space and compares memories based on distances in this space. However, NIM differs from standard mathematical psychology models in that (a) it uses actual facial images as input and (b) it is based on the idea of storing fixation-based face fragments, rather than whole face exemplars. The NIM model's memory is reminiscent of a kernel density estimator but differs in important details from a true probabilistic model in the way that the estimates from individual fragments are combined. In this paper, we present a Bayesian version of the NIM model that uses naive Bayes to combine the likelihood estimates from individual fragments. We further extend the model to perform multi-class visual memory tasks and to use a variety of kernels for density estimation. Our model, which we call NIMBLE (for NIM with Bayesian Likelihood Estimation), achieves human levels of performance on a standard face recognition task and also performs multi-class face and object identification tasks with high accuracy. Bayesian combination of individual fragment likelihoods outperforms the combination method from the original NIM model in most cases, and the new kernels far outperform those used in NIM.

*G*(

*i, j, θ*) is the magnitude response of a Gabor filter with orientation

*θ*centered at pixel (

*i, j*), and

*μ*

_{G}(

*i, j*) is the mean response across all eight orientations. A similar technique developed by Renninger, Coughlan, Verghese, and Malik (2005) defines salience as the entropy, rather than the variance, of local image contours.

*r*(a model parameter) of the new fragment in the memory space. Averaging these familiarities over all samples from a new image produces an estimate of the probability that the image is from the class known to the memory. The memory space introduced by the NIM model has been shown to achieve the best known correlation with human judgments of perceptual similarity (Lacroix et al., 2006), and the retrieval methods exhibit human performance effects (such as list length and list strength) on face recognition memory tasks (Lacroix et al., 2004).

*m*

_{1}, …,

*m*

_{M}}, lie within a radius

*r*of the new image fragment. Thus, the familiarity of the new fragment,

*f,*is defined by

*N*fragments

*F*= {

*f*

_{1}, …,

*f*

_{ N}}. In the NIM model, Lacroix et al. (2006) define the familiarity of a test image as the mean of the familiarities of all

*N*fragments taken from that image:

*β*and

*θ*are parameters of the model used to fit the performance to human data.

*N*fragments,

*F*= {

*f*

_{1}, …,

*f*

_{ N}}, under the models for each of a number of image classes. For instance, we handle the previously described familiar/unfamiliar faces task as a two-class problem and can additionally handle other classification tasks such as Alice/Bob/Carol/Dan/unknown or dogs/not dogs. For each class,

*c,*we use Bayes rule to compute the posterior distribution:

*p*(

*F*∣

*c*) is the likelihood of the set of image fragments under the density model for class

*c,*and

*p*(

*c*) is the class prior which may be learned from experience with training data.

*p*(

*F*∣

*c*), by combining the likelihoods of each individual fragment,

*p*(

*f*

_{ i}∣

*c*), as explained in the Naive Bayes fragment combination section. Each of these class-conditional fragment likelihoods is computed using kernel density estimation (see the Kernel density estimation section).

*f*

_{ i}∈

*F,*given the class, and take the product of the individual fragment likelihoods to obtain an estimate of the overall likelihood function:

*c*versus all other images. The Bayes decision rule classifies the image as coming from class

*c*when Equation 9 is positive and from class

*f*under each of these kernels. The sum of these probabilities forms the overall estimate of the likelihood of the new fragment,

*p*(

*f*∣

*c*). The choice of kernel function and the parameters that control its shape are design features of the model, which we will consider below.

*r,*with uniform density, at the location of each stored exemplar in memory space. The familiarity of a new fragment,

*f,*can be viewed as summing its density under all of these uniform kernels. By casting the problem of memory retrieval as a kernel density estimation task, we can explore the model's performance under a variety of kernel functions beyond the hypersphere in Equation 2. Indeed, this NIM kernel prohibits using the naive Bayes combination of fragment likelihoods (Equation 7), since if a test fragment

*f*were to find no stored points within radius

*r,*it would be assigned zero likelihood. In that case, even if all other fragments were strongly predictive of the class, the resulting product of fragment likelihoods would be

*p*(

*F*∣

*c*) = 0.

*N*(

*x, μ, σ*) represents the normal probability density function of

*x*with mean

*μ*and variance

*σ*, and

*M*

_{ c}= {

*m*

_{1},

*m*

_{2}, …,

*m*

_{∣ Mc∣}} is the set of previously memorized fragments from class

*c*. The second is a

*k*-nearest-neighbor (kNN) kernel:

*V*is the minimum volume centered at

*f*that contains

*k*stored memories, of which

*k*

_{ c}are from class

*c*(Bishop, 1995).

*Naive Bayes*and

*Mean familiarity,*respectively. In each table, we also indicate the best parameter setting (value of

*σ*or

*k*) for each kernel, where optimization over the parameters was performed using 10 random trials.

Kernel | Face ID accuracy (%) | Object ID accuracy (%) | ||
---|---|---|---|---|

Naive Bayes | Mean familiarity | Naive Bayes | Mean familiarity | |

Gaussian ( σ = 1,10) | 85.6 ± 2 | 72.2 ± 2 | 87 ± 1 | 73.7 ± 2 |

kNN ( k = 1) | 89.2 ± 0.6 | 85.8 ± 2 | 92.7 ± 0.7 | 87 ± 0.4 |

Kernel | Fragment combination | ROC area | |
---|---|---|---|

10-D BG | 80-D BG | ||

Gaussian (σ = 1) | Naive Bayes | 0.94 ± 0.03 | 0.58 ± 0.02 |

Mean familiarity | 0.97 ± 0.02 | 0.62 ± 0.13 | |

kNN ( k = 1) | Naive Bayes | 0.93 ± 0.05 | 0.97 ± 0.02 |

Mean familiarity | 0.96 ± 0.02 | 0.96 ± 0.01 |

*N*= 10 fragments to represent 3 images (with different lighting, expressions, or orientations) from 29 different FERET face identities or 20 COIL-100 object classes and tested on 3 unseen images from each of these classes. In this

*identification*task, the model is presented with a novel test image that it has never seen before, and it must identify which category this novel image belongs to, based on previously studied images from the same face or object category. This is unlike the face

*recognition*task described below in which the model must recognize the exact face images that it has previously studied.

*σ*, for the Gaussian kernel depends on the class of images to be identified since the within-class variance of patches taken from rotating objects (COIL-100) is much higher than the variance across patches sampled from aligned faces (FERET). We fit this parameter by 10-fold cross validation on randomly sampled image sets. Identification task results are shown in Table 1. Our model demonstrates high performance on these multi-class tasks. For example, our best object recognition model (kNN with Naive Bayes) achieves a respectable performance of almost 93%. A state-of-the-art computer vision system for object recognition, Belongie's shape context system (Belongie et al., 2002), achieves 97.6% accuracy on the same task. However, that system uses far more complex— and less biologically plausible—methods for selecting and matching correspondence points.

*p*(

*c*∣

*F*), as each fragment is added to

*F*. With more information, the posterior for the correct class using naive Bayes likelihood combination (Equation 7) rises toward 1, while the posterior calculated using mean familiarity (Equation 8) remains roughly constant. The posterior probabilities of the 28 incorrect classes are not shown, but since the sum over all 29 classes must equal unity, it is clear that each incorrect class has very low probability, and therefore, the Bayes decision rule (Equation 9) almost always results in correct classification. For comparison, random guessing would set

*p*(

*c*∣

*F*) =

*N*= 10 fragments (approximating the number of saccades a human makes in 3 s) from each of 32 target images of faces. NIMBLE samples each of the 32 target faces and stores the resulting 320 fragments in the model's memory space. During the testing phase, NIMBLE extracts a new set of

*N*fragments from 64 test face images, of which 32 are the original targets and 32 are novel distracters, known as lures. The model's task is to classify each image in the test phase as target (familiar) or lure (unfamiliar).

*p*(

*f*∣

*p*(

*f*∣

*k*= 1) uses only one data point, unlike the Gaussian model which takes input from every point in memory. As a result, the kNN model is less affected by noise.

*p*(

*c*∣

*F*) and lure distributions

*p*(

*F*). Computing the (log) ratio of these probabilities (as in Equation 9) for each image provides a ranking of the images in order of how likely they are to be a familiar target image; larger values of

*p*(

*c*) and

*p*(

*A*′). For example, when

*p*(

*c*) = 0, all images are deemed to be lures, whereas when

*p*(

*A*′ (a bias-free, nonparametric estimate of ROC area) in the range of 0.9 to 1.0 for this task (e.g., Duchaine & Nakayama, 2005; Hsiao & Cottrell, in press), and NIMBLE performs similarly. A more detailed analysis of NIMBLE's performance in comparison to humans is given in the NIMBLE using human fixations section.

*A*′, a bias-free nonparametric measure of sensitivity that estimates ROC area, showed that the optimal human recognition performance was achieved with two fixations—performance did not improve with additional fixations. This is illustrated in Figure 4: The

*A*′ in the two-fixation condition was significantly larger than that in the one-fixation condition (

*F*(1, 15) = 44.435,

*p*< 0.001); in contrast,

*A*′ in the two, three, and no restriction (4+ fixations) conditions were not significantly different from each other (there were no statistically significant differences between any two of the three).

*σ*= 0.1 to best fit the human data. (Note that the results for the original NIMBLE face recognition experiments in Table 2 are very insensitive to the value of this parameter, and setting

*σ*= 0.1 with the computed salience map provides similar results to those shown in Table 2.)