Abstract
Goal: Wertheimer (1923) proposed visual similarity as a key grouping factor but a precise definition has proved elusive. We formalize similarity by designing a function W(i,j) whose value is the probability that a pair of points i and j belong to the same visual group. Our goal is to learn an optimal functional form for W(i,j) based on brightness, texture and color measurements, and to quantify the relative power of these cues. Methods: A large dataset (∼1000) of natural images, each segmented by multiple human observers (∼10), provides the ground truth S(i,j) for pairs of pixels. S(i,j) = 1 if the pair lies in the same segment, 0 otherwise and will serve as the target function for W(i,j). We consider both region and boundary cues for computing W(i,j). Region cues are based on brightness, color, and texture differences between image patches at i and j, each characterized by histograms of the outputs of V1 like mechanisms. Oriented filter responses are used for texture and a*, b* features in CIE L*a*b* space for color. Boundary cues are incorporated by looking for the presence of an “intervening contour”, a large gradient (in brightness, texture or color) along a straight line connecting two pixels. The parameters of the patch and gradient features are calibrated using the human segmented images. Performance was evaluated on a separate test set using precision-recall curves as well as mutual information between W(i,j) and S(i,j) based on various cues. Results: For brightness, gradients yield better results than patch differences. However, for color, patches outperform gradients. Texture is the single most powerful cue, with both patches and gradients carrying significant independent information. The mutual information between S(i,j) and W(i,j) using all similarity cues is 0.19 nats, just 0.06 short of that between different human subjects. The proximity of the two pixels does not add any information beyond that provided by the similarity cues.