The initial stages of visual analysis are mediated by neurones that are sensitive primarily to structure falling within a restricted region of visual space known as their classical receptive field (e.g., Hubel & Wiesel,
1968,
1977). For example, in primate visual cortex, neurones in area V1 representing central (foveal) vision have orientation-selective classical receptive fields that often cover less than 1.0 deg
2. Optical imaging studies (e.g., Bonhoeffer & Grinvald,
1991) have revealed that primary visual cortex generates a retinotopic (local) representation of contour orientation. The local image structure that best drives V1 neurons is typically simple in spatial form: perhaps even as simple as to be spatially well approximated by local elongated structures varying in just orientation and spatial scale. If this is the case, then a set of templates (e.g., filters or receptive fields) can be used to create image descriptions of such local structure.
These strictly local measurements must be subsequently linked, as appropriate, across space to encode the overall structure of spatially extensive objects present in the visual scene. Since these larger scale overall structures are essentially unconstrained, the linking processes must be similarly unconstrained. This is important, that the process is capable of generating completely novel outputs (Ullman,
1984; Watt & Phillips,
2000), as it indicates the need for processes that are not readily modelled as templates.
Longer range interactions between local sensors have been implicated in a wide range of perceptual grouping phenomena. These include global stereoscopic form detection (Julesz,
1971), global motion perception (Chang & Julesz,
1984,
1985; Williams, Phillips, & Sekuler,
1986), contextual modulation of contrast sensitivity (Polat & Sagi,
1993), image segmentation (Kovács & Julesz,
1993), and detection of extended motion trajectories in noise (Bex, Simmers, & Dakin,
2003; Verghese, Watamaniuk, McKee, & Grzywacz,
1999).
The issue of how local orientation measurements are combined across space to describe extended spatial contours has received considerable interest in the last decade or so, driven in no small part by the development of the contour-detection paradigm of Field, Hayes, and Hess (
1993). Here, the observer's psychophysical task is to detect the presence of a smoothly curved contour (path), composed of a set of spatially separated oriented Gabor patches (elements), in an array of similar but randomly oriented background elements. Performance on this task depends on many stimulus parameters that have been summarized elsewhere (for reviews, see Hess & Field,
1999; Kovács,
1996), but the main findings are described below.
The most widely considered stimulus parameter is the contour curvature (path angle): the change in orientation between adjacent Gabor elements. For example Field et al. (
1993) report that detection performance is best (∼100% correct) for straight paths and declines gradually as the degree of curvature of the target contour increases, reaching chance levels for path angles greater than about 60°. However, this upper limit on performance can vary between 30 to 60° depending on various stimulus factors.
Performance on the psychophysical task is highly dependent on the density of elements, at least for moderately and highly curved contours (Li & Gilbert,
2002; Pennefather, Chandna, Kovacs, Polat, & Norcia,
1999), and fails at inter-element separations of about 4 to 6 times the Gabor wavelength (Kovács & Julesz,
1993).
Observers' ability to detect the presence of a target also depends upon the orientation of the Gabor elements with respect to the local orientation of the contour they form (Bex, Simmers, & Dakin,
2001; Field et al.,
1993; Ledgeway, Hess, & Geisler,
2005). Performance is best when the element orientations are aligned with the local contour orientation (snakes), moderately worse when they are orthogonal (90°, ladders), and hardest to detect when they are oriented obliquely (45°) with respect to the contour. All three stimuli provide the same basic statistical quality of cue for observers—that snakes are nonetheless easier to detect than the others suggests that the mechanism involved is specialized for contour integration.
Field et al. (
1993) interpreted their original results in terms of an “association field” that combines responses from neighboring local filters or receptive fields tuned to similar orientations. The notion is that there is an association field about each element that produces a scalar association output that increases with the amount of nearby correctly oriented and aligned contour structure. The strength of the association is greatest when there are nearby elements that are collinear and reduces with increasing distance, curvature, and misalignment from co-circularity. Stated in this way, the association field does not make explicit the relationship between elements and does not identify contours. However, elements that are part of a smooth contour will tend to have higher associations.
There are, however, a number of potential limitations to the association field concept as a general model of contour description in vision. First, the structure of the association field may be over-constrained. It is essentially a template for stimuli that have been found to be detectable and as such may be more descriptive of the results of experiments rather than explanatory in nature. There could be many other templates (not necessarily so closely modelled on psychophysically detectable stimuli) that are equally able to distinguish smooth contours from noisy backgrounds by virtue of some combination of the differences between contour and background.
Second, one fundamental property of the association field, as originally conceived by Field et al. (
1993), is that adjacent elements are strongly linked only if they satisfy the joint constraints of position and orientation along smooth curves. There is evidence that contours are readily detected when defined by other types of position and orientation relationship. For example, a single association field cannot explain the (non-monotonic) patterns of performance found when the orientations of the Gabor elements along a path are systematically misaligned with respect to the axis of the contour they depict (e.g., Ledgeway et al.,
2005). The notion of composite association fields, simultaneously sensing multiple different contour forms, may appear to deal with this but such models cannot subsequently distinguish which pattern they have responded to. The consequence is that multiple association fields with different orientation and positional properties (e.g., Yen & Finkel,
1998) become required—essentially further templates.
Although there is some suggestion that the shape of contours in natural images may be statistically constrained somewhat (Geisler, Perry, Super, & Gallogy,
2001; see also Sigman, Cecchi, Gilbert, & Magnasco,
2001), there is in principle no limit on the form of a contour in an image. If the range of possible contour shapes is unlimited, then template methods of contour description are difficult and generative descriptive methods are more appropriate. A generative process is like a language: it uses a small number of symbols (say representing local piecewise straight segments of contours) and combined by a small number of methods (such as connected to) but in unlimited numbers to have an infinite descriptive capacity. Any continuous line can be reasonably described by a concatenation of local pieces. Watt (
1991) has shown a form of image description language that is of this type.
Consequently there is a need to consider alternative schemes that may also be able to account for human performance measured with the path-detection paradigm. The local output of an association field is a measure of how good the relationship is between the element at that point and any others that fall within the association field range. In effect, it combines two logically distinct operations: linking an element to those around it and simultaneously assessing the goodness of the resultant contour configuration of all linked elements. In what follows, we compare this type of model with a link-then-describe type of model.
In a link-then-describe model, the two types of spatial relation, relative position and relative orientation, could be used to constrain either the link or the description. The aim of the present study is to investigate the separate contributions of each of these two processes to performance. The results of psychophysical experiments are compared to the performance of three different families of model, each using constraints based on a different combination of element position and orientation information. The comparison illuminates the relative significance of the different information sources.
Specifically, the models considered are
The first family of models, association fields, have selectivity for both where elements are with respect to each other, and what form their combination makes: a part of circle or something similar. The other two families only have selectivity for where elements are with respect to each other, regardless of the resultant form. In the light of the computational need for generativity and to be able to recognize a wide range of forms, the second and third families of models might be computationally more desirable.
To anticipate the results we find that all models exhibit similar performance to human observers, despite their qualitative differences. In this regard there is little empirical basis to select one type of model over another, at this stage, other than a general preference for the more simple models.