We based our model on recent models of disparity extraction that have used a population of local cross-correlators that compare information between the two eyes' views (Banks et al.,
2004; Filippini & Banks,
2009; Goutcher & Hibbard,
2014; Nienborg et al.,
2004; see also Ohzawa, DeAngelis, & Freeman,
1990). This style of model has been primarily used to explore stereo resolution, the smallest spatial variations in depth that can be discriminated, and thus the focus has been on finding the smallest useful correlation window. Recently, data showing that human observers can perceive depth from sinusoidal oscillations as veridically as those from square wave oscillations (Allenmark & Read,
2010) has provided a challenge to the original model (Banks et al.,
2004). A modified model, in which larger correlation windows are used to detect larger binocular disparities, which was developed by Allenmark and Read (
2011), exploiting the so-called “size-disparity correlation” (Smallman & MacLeod,
1994), can better account for human performance. Smallman and MacLeod found that optimal disparities for high spatial frequency information were small and for low spatial frequency were larger, leading to the idea that large-scale (low spatial frequency) receptive fields are involved in processing large disparities, and small-scale (high spatial frequency) receptive fields are involved in processing small disparities (e.g., see Filippini & Banks,
2009; Harris, McKee, & Smallman,
1997).
Here we made the assumption that large correlation windows would specialize in processing low spatial frequency information (and vice versa for small windows). Note that we implement this assumption in a different way from Allenmark and Read (
2011).
We started with a pair of images containing a blank background, of intensity 0.5, and disparate line elements, of intensity 1, generated in the same way as for our experiments. Images were 230 × 230 pixels, and disparity separation between the bounding planes was 16 pixels. Here we chose to implement a simple cross-correlation model that contains some of the key features from these other models. Some models (Allenmark & Read,
2010,
2011; Filippini & Banks,
2009) have aimed to emulate the front-end of the visual system as closely as possible, by filtering images to account for the eye's optics, as they were used to primarily address questions about the limits of stereo resolution. Goutcher and Hibbard (
2014) used a correlation-based disparity extraction model to study depth perception from ambiguous random dot stereograms. To model their data, they required an initial spatial frequency filtering stage, followed by cross-correlation. We adopted that idea here, combining an initial spatial frequency filtering stage with a constraint that we used larger cross-correlation windows for the lower ranges of spatial frequency in line with the size-disparity correlation. Goutcher and Hibbard used a single, large correlation window size in their model. But their stimuli contained constant disparity across the scene, making a large correlation window optimal. That clearly would not be optimal for our volume stimuli, containing elements with many different depths.
We implemented bandpass filtering of our images,
IL(x,y), in the same way as Goutcher and Hibbard (
2014). We conducted bandpass filtering using a filter of bandwidth
b, around a central spatial frequency of
fc. To achieve this, the Fourier amplitude spectrum of the image was multiplied by a mask:
Here we chose the bandwidth,
b, to be ±1 octave. For each spatial scale that we explored, a window size was chosen for the cross-correlation (wsize). We next needed to choose a relationship between window size and the frequency band being explored. To fit with the size-disparity correlation, there should be an inverse relationship between center frequency and window size The central frequency of the bandpass filter was chosen to have one of four relationships:
f = 0.25/wsize,
f = 0.5/wsize,
f = 1/wsize, or
f = 2/wsize. In the
Results section, we discuss the implications of these choices.
Examples of these prefiltered images are shown in
Figure 5a. At this point, we also added some random monocular noise independently to the right and left images, adding random luminance noise of intensity 0.001. We then ran a cross-correlator of window size
wsize across the left and right images for each pixel value (x,y) within the images. The windows for each eye were at the same vertical position but different horizontal positions. For each location in the image, (x,y), we held the left-eye window at that location. The correlation window,
Lw, was defined as the set of image values
ILf(
i,
j) such that |
x−
i| <
wsize/2 and |
y−
j| <
wsize/2. The right-eye window was presented at the same vertical location as the left-eye window but could have a horizontal offset, or disparity,
disp. The correlation window for this eye,
Rw, was defined as the set of image values
IRf(
i,
j) such that |x+
disp−i| <
wsize/2 and |
y−
j| <
wsize/2.
The correlation, for any disparity
disp, was then defined as
We can think of the function C(y,disp) as representing the output of a set of disparity detectors, each centered at location (x,y), and disparity disp.
The next choice to make was what disparity range to choose. In line with the size-disparity correlation described above, we chose a range proportional to the window size—in this case, twice the window size. Hence, for any image location, (x,y), there were 2 ×
wsize correlations recorded with possible disparity values from −
wsize to +
wsize (illustrated in the middle panel of
Figure 5a).
We then used the output of the correlator to decide what disparity should be represented at each location (x,y).
Figure 5b illustrates an example, showing correlation as a function of disparity across the range of disparities chosen. We found the peak of this correlation function and chose the disparity corresponding to the peak to represent the disparity between left and right images at location (x,y). This is akin to choosing the peak response from a population of disparity-tuned neurons.
Our aim here was to represent disparity across the whole image, so we repeated the correlation process for all locations (x,y). The family of disparities produced can be presented as a histogram showing the frequency of occurrence of each disparity.
Figure 5c shows an example histogram for a stimulus composed of a pair of planes, located at disparities −8 and +8 pixels. Because of the restricted window size, range, and filtering, the histogram does not show disparities only at those values but instead shows a broad range with noticeable peaks at −8 and +8.
Finally, the overall goal was to model how human observers might represent the depth of the volume of elements. To do this, one must choose a decision rule to be implemented in the model. In other words, how is the distribution of disparities, such as those in
Figure 5c, used to decide which of two stimuli has the deeper volume? We chose the simplest possible decision rule that avoids prior knowledge of the actual image disparity distribution by recording the thickness of the distribution as twice the standard deviation of the distribution of disparities delivered by the disparity-extraction stage of the model (arrowed line in
Figure 5c).