This paper describes a model for item selection in visual search, which proposes that the visual system first clusters the items in the display, primarily on distance, and then searches those clusters for the target item. The search is accelerated by attaching “object files” to each cluster, which list the various features possessed by all the items in the cluster. Only those clusters whose object files contain the target will be searched. This model can account for search asymmetry and the search rate for conjunction search. It is better at accounting for target-present RTs than target-absent RTs.
The idea that search might be influenced by clustering of items has been previously proposed by many others (Bundesen & Pedersen,
1983; Duncan,
1995; Duncan & Humphreys,
1989; Humphreys et al.,
1989; Kim & Cave,
1999; Poisson & Wilkinson,
1992; Treisman,
1982). The main contribution of the model proposed here is defining exactly how the clustering of items affects search (namely via the object files attached to the clusters) and specifying a simple algorithm that does a good job of replicating human performance over a range of different search experiments.
Similarly, the hierarchical nature of the clusters is not new. Cluster analysis is, of course, a standard statistical procedure. The search procedure is similar to some algorithms for searching high-dimensional feature spaces (Brin,
1995; Hjaltason & Samet,
2003) although these are over partitioning trees rather than clusters. One prominent search model that uses a hierarchical architecture is that proposed by Tsotsos (Tsotsos,
1990; Tsotsos et al.,
1995). However, the search procedure in the current model relies on the object files that are attached to the clusters, and entire clusters are rejected or inhibited based on the object files. The Tsotsos model is instead aimed at efficiently finding the maximum of a saliency measure over the image. The advantage of the Tsotsos model compared to the model presented here is that it is specified as a neural net and can be used on raw image data. However, the Tsotsos model does not seem to have been used to predict RT versus set size data over a range of different experiments. In addition, the Tsotsos model does not use object files. FACADE (Grossberg et al.,
1994) is another theory for visual search that proposes that items be organized into a hierarchy of groups. In FACADE, the items are partitioned into groups based on a commonality of features, and those groups are then selected (or ignored) and searched in a serial manner. The FACADE model appears to be a roundabout way of clustering items based on proximity and features. FACADE also lacks a concept of object files.
One popular idea in search that this model does not use is salience. Salience is a mapping from feature space into the real line; items with more salience are more likely to be selected than items that have less salience. The cluster search model can generate good fits to RT data without referring to salience at all. This is not to say, however, that some things do not irresistibly grab our attention, but rather that this grabbing of attention might be a different phenomenon from efficient (i.e., constant-time) search.
The cluster algorithm depends on comparing target features to the cluster object files. The object files treat features as if they were either present or absent. However, features are often present to some degree. For example, a red item may have features that are qualitatively different from a green item, but they are only quantitatively different from a more intense red item. If object files record the degree to which a feature is present, rather than merely presence or absence, a different procedure for comparing target and cluster object files is needed. This could be implemented as follows. Let f = {z1, z2, … zn} be an object file, where zi represents the intensity or amplitude of feature i. We combine two object files by taking the maximum of each feature intensity. That is, the object file for the merger of clusters Ca and Cb is {max(z1a, z1b), max(z2a, z2b)} where zia and zib are the intensities of feature i in respectively clusters Ca and Cb. A target might be in a cluster if all of its feature intensities were less than or equal to the corresponding intensities in the cluster object file.
This change would allow the model to deal with searches where the target is more or less intense along some feature dimension, e.g., redness. It suggests that a more intense target would pop out relative to less intense distractors, but not vice-versa. However, feature intensities would, inevitably, be represented with error. Thus the decision about whether a target might be in a cluster is a form of intensity discrimination task. The theory behind such tasks is signal detection theory, and this suggests that many of the ideas in SDT that have been applied to visual search (Eckstein et al.,
2000; Palmer et al.,
1993; Verghese,
2001; Vincent,
2011) could, in the future, be integrated into a cluster search procedure.
To conclude, this model is a good but incomplete model for visual search. It is primarily about the process that might drive the selection of items in search, and the other aspects of the model are sketchy or nonexistent. Thus there are a number of things the model does not do well. First, it makes no errors. Given that humans make relatively few errors under most search conditions, this is a problem but not a fatal one. It is possible that weakening some of the assumptions underlying search termination would introduce a realistic level of miss errors, but there is no obvious mechanism for false alarms. Second, the distribution of reaction times is poorly specified. The model is good at predicting average reaction times, but the overall distribution contains important information about search processes (Wolfe,
1998). Third, although the clustering process in the model relies on a measure of distance between items, the measure is incompletely specified. Luckily, in this case, the Euclidean distance seems to be the dominant aspect of the true distance measure, so using the Euclidean distance yields realistic results. More interesting results could be obtained by implementing a distance measure based upon, say, edge co-occurrence (Geisler, Perry, Super, & Gallogly,
2001), and which could lead to a unification of search models with models of perceptual organization. Finally, the model, like many others, assumes a set of features have been defined and extracted from the image prior to commencing. To some extent, this is not a huge problem. Regardless of the features, the model does predict search asymmetry and the effects of feature conjunction. However, the occurrence of search asymmetry or feature conjunction cannot be predicted from a raw image until the exact features are specified in enough detail that they can be extracted from the image. This is a promising avenue for future research.