Abstract
Goal: Psychophysics, e.g. Rivest and Cavanagh (1996), has shown that humans make combined use of multiple cues to detect and localize boundaries in images. We use a dataset of natural images to learn optimum cue combination of local brightness, texture and color, as well as quantify the relative power of these cues. Methods: Cue combination is formulated as supervised learning. A large dataset (∼1000) of natural images, each segmented by multiple human observers (∼10), provides the ground truth label for each pixel as having an oriented boundary element or not. The task is to model the posterior probability of a pixel being at a boundary, at a particular orientation, conditioned on local features derived from brightness, texture and color. Our features are based on computing directional gradients of outputs of V1-like mechanisms. Texture gradients are computed as differences in histograms of oriented filter outputs, and color gradients on histograms of a*, b* features in CIE L*a*b* space. Several types of classifiers ranging from logistic regression to support vector machines were trained. Performance was evaluated on a separate test set using a precision-recall curve which is a variant of the ROC curve. This curve can be summarized by its optimal F-measure, the harmonic mean of precision and recall. Results: (1)The precise form of the classifier does not matter-equally good results were obtained using logistic regression (weighted linear combination of features) as with more complicated classifiers. (2) Singly, brightness, texture and color yield F-measures of 0.62, 0.61, and 0.60 respectively. The optimal gray-scale combination of brightness and texture has an F-measure of 0.65 and addition of color boosts it to 0.67. These results indicate that the different cues are correlated but do carry independent information. By measuring inter-human consistency, the gold standard F-measure is 0.8, thus quantifying the gap left for more global and high-level processing.