September 2016
Volume 16, Issue 11
Open Access
Article  |   September 2016
Depth discrimination from occlusions in 3D clutter
Author Affiliations
Journal of Vision September 2016, Vol.16, 11. doi:https://doi.org/10.1167/16.11.11
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Michael S. Langer, Haomin Zheng, Shayan Rezvankhah; Depth discrimination from occlusions in 3D clutter. Journal of Vision 2016;16(11):11. https://doi.org/10.1167/16.11.11.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Objects such as trees, shrubs, and tall grass consist of thousands of small surfaces that are distributed over a three-dimensional (3D) volume. To perceive the depth of surfaces within 3D clutter, a visual system can use binocular stereo and motion parallax. However, such parallax cues are less reliable in 3D clutter because surfaces tend to be partly occluded. Occlusions provide depth information, but it is unknown whether visual systems use occlusion cues to aid depth perception in 3D clutter, as previous studies have addressed occlusions for simple scene geometries only. Here, we present a set of depth discrimination experiments that examine depth from occlusion cues in 3D clutter, and how these cues interact with stereo and motion parallax. We identify two probabilistic occlusion cues. The first is based on the fraction of an object that is visible. The second is based on the depth range of the occluders. We show that human observers use both of these occlusion cues. We also define ideal observers that are based on these occlusion cues. Human observer performance is close to ideal using the visibility cue but far from ideal using the range cue. A key reason for the latter is that the range cue depends on depth estimation of the clutter itself which is unreliable. Our results provide new fundamental constraints on the depth information that is available from occlusions in 3D clutter, and how the occlusion cues are combined with binocular stereo and motion parallax cues.

Introduction
The human visual system is said to have evolved over millions of years, predominantly in cluttered three-dimensional (3D) environments such as forests and grasslands. Such environments contain objects such as trees, shrubs, and tall grasses that consist of thousands of individual surfaces scattered in 3D space. Such 3D clutter creates occlusions that reduce object visibility (Changizi & Shimojo, 2008). A key result is that depth from binocular stereo and motion parallax cues is less reliable in 3D clutter since the reduced visibility makes it more difficult and sometimes impossible to solve the associated correspondence problems. 
Previous studies of depth perception in 3D clutter have addressed the clutter itself. For example, many studies using random dot or random line stimuli have asked how many discrete depth planes can be perceived using binocular stereo (Akerstrom & Todd, 1988; Tsirlin, Allison, & Wilcox, 2008) or motion parallax (Andersen, 1989). Others have asked how well the visual system can perceive the depth to width ratio of the 3D clutter (van Ee & Anderson, 2001: Harris 2014). These studies have provided key insights into depth perception from binocular stereo and motion parallax cues in 3D cluttered scenes. However, these studies are incomplete since they use only points or lines for the clutter rather than surfaces and, as such, they ignore occlusion effects that are very important in 3D cluttered scenes. 
In this paper we consider 3D cluttered scenes that consist of 2D surface elements that are randomly placed in a volume and that produce a significant amount of occlusion. We examine the depth cues that are provided by the occlusions and how well observers use these depth cues. Specifically, we examine how well observers can discriminate the depths of identifiable surfaces that are located within the 3D clutter. 
Consider the examples shown in Figure 1. Each scene consists of a large number of random distractors that define the clutter, along with two identifiable objects (targets). The distractors are random gray-colored squares that are distributed over a cube volume. The two targets are red rectangles that lie at different depths within the clutter. Figure 1a shows an example of two short bar targets in the left and right half of the clutter, and Figure 1b shows an example of two long bar targets in the upper and lower halves of the volume. 
Figure 1
 
Examples of the 3D cluttered scenes. (a) Short bar targets; (b) long bar targets.
Figure 1
 
Examples of the 3D cluttered scenes. (a) Short bar targets; (b) long bar targets.
We present experiments that examine how well observers can discriminate the depth of such targets. The observer's task is to decide which of the two targets is closer to the observer. Our main goal is to understand the depth information that is available from occlusions, and also how occlusion cues are combined with parallax cues, namely binocular stereo and motion parallax. Our key contribution is to identify two new occlusion-based depth cues, and to show that human observers use these cues. Classic occlusion-based depth cues define an ordering constraint only (although see Burge, Fowlkes, & Banks, 2010). The new cues that we identify are metric depth cues that arise from the random nature of 3D clutter. 
The first occlusion cue, which we call the “visibility” cue, is based on a probabilistic relationship between the depth of a target and the visibility of the target. Intuitively, as a target surface moves deeper into the clutter, it tends to be less visible, that is, more occluded (see Figure 2a). Formally, we define the visibility of a target to the fraction of the target that is visible, that is, not occluded. Assuming the elements of the 3D clutter are uniformly distributed over the cube volume, the probability that a point is visible decreases exponentially with depth in the clutter (Langer & Mannan, 2012). It follows that the expected value of the visibility of a target decreases exponentially with depth as well. Figure 2b shows the average visibility of short bar targets (Figure 1a) as a function of depth within the clutter over many randomly generated scenes. The error bars in Figure 2b show the standard deviation of visibility. Visibility is a cue for depth discrimination because, when two targets are presented, the one that is less visible is likely to be deeper. Because of the variability in the relationship between depth and visibility, however, this cue does not always produce the correct response in comparing the depths of two targets, and the visual system would benefit from using other cues as well. 
Figure 2
 
Visibility cue. (a) As a target's depth increases, the target tends to be more occluded and hence less visible. (b) Mean and standard deviation of visibility of short bar targets for one eye's view. The plot for the long bar targets is similar (not shown) but the standard deviations are about one-third as large. (c) Blue curve shows probability that a target point is visible to a monocular observer, say the right eye. Red curve shows probability that a target point is visible to both eyes of a binocular observer. Gray curve is the ratio of these binocular to monocular probabilities, which is the conditional probability that a target point is visible to both eyes, given that it is visible to the right eye (see figure 8 of Langer & Mannan, 2012).
Figure 2
 
Visibility cue. (a) As a target's depth increases, the target tends to be more occluded and hence less visible. (b) Mean and standard deviation of visibility of short bar targets for one eye's view. The plot for the long bar targets is similar (not shown) but the standard deviations are about one-third as large. (c) Blue curve shows probability that a target point is visible to a monocular observer, say the right eye. Red curve shows probability that a target point is visible to both eyes of a binocular observer. Gray curve is the ratio of these binocular to monocular probabilities, which is the conditional probability that a target point is visible to both eyes, given that it is visible to the right eye (see figure 8 of Langer & Mannan, 2012).
Binocular disparity is another cue that could be used. However, perceiving the depth of a target from binocular disparity requires that points on the target can be matched in the two eyes. In particular, disparity is defined by a position difference in the left and right eye images and so this cue requires that points are visible to both eyes. Figure 2c (red curve) shows the fraction of target points that are visible to both the left and right eyes, assuming an interocular distance of 6.5 cm and the viewing parameters defined later. If a target is near the front of the clutter, then it tends to be almost entirely visible to both eyes. As the target descends into the clutter, though, the fraction of surface points that are visible to both eyes falls off faster than the fraction that is visible to one eye. For example, near the center of the clutter (10 cm), the binocular visibility of the target is just over half that of the monocular visibility of the target. Thus, to discriminate the depths of two targets that are both near the center of the clutter using binocular disparity, the visual system would need to deal with many false matches in solving the correspondence problem, which limits performance. A similar limitation exists with motion parallax cues. 
The second occlusion cue is based on a probabilistic relationship between the depth of a target and the depths of the surfaces that occlude the target. We call it the “range” cue. The idea is that, if the depth of the occluders can be estimated using binocular stereo or motion parallax cues, then these occluder depths provide a lower bound on the depth of the target. In turn, the lower bounds for two targets constrain which target is more likely to be deeper. To illustrate, consider Figure 3a, which shows two targets that are each partly occluded by one occluder. Suppose the left target's occluder was at depth d1 cm and right target's occluder was at depth d2 cm, where d1 < d2. Given this information about the occluders, the observer should infer that the right target is more likely to be deeper. The reason is that the right target must lie at a depth beyond d2 whereas the left target could lie at depths between d1 and d2 or beyond d2. More generally, in a cluttered scene, if an observer can perceive the depth ranges of multiple occluders of each of two targets, then the observer should infer that the deeper target is the one with the deepest occluder. 
Figure 3
 
Range cue. (a) If the occluder on the right is deeper, then it provides a greater lower bound on the target depth, and so the target on the right is more likely to be deeper. (b) The mean and standard deviation of the maximum depth of an occluder for a short bar target of depth Z. The corresponding plot for the long bar targets is not shown because the “max occluder” curve is nearly identical to the target curve and error bars nearly vanish.
Figure 3
 
Range cue. (a) If the occluder on the right is deeper, then it provides a greater lower bound on the target depth, and so the target on the right is more likely to be deeper. (b) The mean and standard deviation of the maximum depth of an occluder for a short bar target of depth Z. The corresponding plot for the long bar targets is not shown because the “max occluder” curve is nearly identical to the target curve and error bars nearly vanish.
Figure 3b shows the mean and standard deviation of the deepest occluder points of a short bar target at depth Z in a cluttered scene (see Figure 1a). There is a strong correlation between the maximum occluder depth and the target depth. Note that to use this range cue, the observer must be able to estimate accurately the depth of the occluders, for example, from binocular stereo or motion parallax cues. 
We have defined two probabilistic occlusion cues, namely visibility and range. We examine whether these two cues are used by carrying out a set of depth discrimination experiments using scenes such as rendered in Figure 1. See Figure 4 for an illustration of the viewing situation. Each scene consists of two targets that are embedded in 3D clutter and separated in depth. Each scene is viewed using a combination of occlusion cues (visibility and range) and parallax cues (stereo and motion). For each combination of cues, we measured depth discrimination thresholds. On the one hand, we expect the presence of 3D clutter to raise thresholds, relative to a baseline condition of no 3D clutter. The reason is that the occlusions created by the 3D clutter should reduce the reliability of stereo and motion cues when they are available. On the other hand, occlusions can provide depth information via the visibility and range cues, and so we would expect thresholds to be lower when these cues are present. 
Figure 4
 
The observer's task is to judge which of the two red targets is closer. The 3D clutter is contained in a cube. The uniform grid of distractors illustrates that a uniform probability distribution was used. See text for details about the 3D clutter and viewing parameters.
Figure 4
 
The observer's task is to judge which of the two red targets is closer. The 3D clutter is contained in a cube. The uniform grid of distractors illustrates that a uniform probability distribution was used. See text for details about the 3D clutter and viewing parameters.
Method
Apparatus
The rendering and control software ran on a Dell Precision T7610 workstation (Dell, Round Rock, TX) equipped with an NVIDIA Quadro 4000K graphics card (NVidia, Santa Clara, CA). Scenes were rendered in real time using a head coupled perspective model of the observer's left and right 3D eye positions (Sutherland, 1968). We used two different displays for our experiments, both of which provided binocular stereo and motion parallax capability. 
The first display was an Oculus Rift DK2 (Oculus VR, Menlo Park, CA). For this display, scenes were rendered using Unity 3D, and using C# as the scripting language. The Rift comes with a motion sensing camera and accelerometer for position tracking, and a gyroscope and magnetometer for orientation tracking. For a description of the Oculus Rift DK1 version, see LaValle, Yershova, Katsev, & Antonov (2014). Observer position tracking was achieved using the Unity plugin provided in Oculus Rift SDK. The position update rate for the Rift is 60 Hz and the display refresh rate is 75 Hz. The Rift has an organic light-emitting diode display with a resolution of 1920 × 1080 pixels (960 × 1080 per eye), and a nominal diagonal field of view of 100°. This yields about 11 pixels/°, or 5.5 arcmin separation between pixels. This resolution is relatively low as individual pixels are visible in this display. Moreover, chromatic separation of individual pixels is common. Such limitations would make this display unsuitable for precise depth discrimination experiments but for our task, this resolution was sufficient. To be sure though, we repeated the experiment with a higher resolution display. 
The second display was a 1080 p Acer GD235HZ stereo monitor (23.6 inches; Acer, New Taipei, Taiwan) viewed through NVIDIA 3D Vision shutter glasses. At viewing distance of 60 cm, the screen's interpixel distance was 1.55 arcmin. The screen was refreshed at 120 Hz, so the frame rate for each eye was 60 frames/s. Scenes were rendered using OpenGL. To render the scene using head coupled perspective, we used the “fishtank VR” method (Arthur, Booth, & Ware, 1993). We tracked head position and orientation using a midrange 3D Guidance trakSTAR transmitter (Ascension Technology, Shelburne, VT) with magnetic sensors, which were attached to the two handles of the 3D glasses. The position update rate was 80 Hz. Virtual eye position for rendering was set to be along the line segment connecting the two sensors. For the binocular disparity conditions, an interocular distance of 6.5 cm was used. To achieve head coupled perspective, we measured the screen position and orientation relative to the trakSTAR coordinate system, which was defined by the magnetic field transmitter. We then combined the emitter, screen, and glasses coordinate systems and rendered each 3D scene in real time from the modeled viewpoint of the viewer. 
Stimuli
Scenes were rendered either with or without binocular stereo cues, and we refer to these two conditions as “stereo” or “mono,” respectively. For the experiment that used the Oculus Rift display, the mono condition presented the same image to both eyes in each frame. The image was rendered from the midpoint between the two eyes. For the experiment that used the fishtank VR display, the mono condition presented an image to one eye only. This was achieved by rendering both eye views, and inserting a large black square or virtual eye patch in front of the scene for one eye. 
Scenes also were rendered either with or without head coupled perspective. We refer to these as “motion” or “no motion” conditions, respectively. The depth information in the motion condition is not merely due to relative image velocity (motion parallax), however. More generally, the information comes from having multiple views, that is, head coupled perspective. For the “no motion” conditions, we instructed observers not to move their heads and we discarded trials in which they moved their heads by more than a small amount. For this condition, we also turned off the head coupled perspective when rendering and instead we fixed the virtual observer's position and orientation to a standard view, 60 cm from the front and center of the clutter volume. 
Scenes were generated as follows. On each trial, the XYZ positions of the two targets were initialized to be at the center of a bounding XYZ volume of size 20 × 20 × 20 cm. The standard observer position thus was 70 cm from the center of this volume. The targets were separated in depth by an interval ΔZ, namely they were positioned at depths    
The value ΔZ was chosen using a staircase procedure that will be described as follows. The short bar targets then were separated horizontally (X) by 10 cm, and the XY position of each was randomly perturbed by up to 1 cm in X direction and up to 1.5 cm in Y direction. The long bar targets always were separated vertically by 6.7 cm. The ends of the long bar targets were hidden behind two large flanking occluders (Figure 1b). 
The 3D clutter in each trial was defined by generating 1,331 (113) distractors. Each distractor was a square of width 12 mm, and was assigned a random gray level and random 3D orientation. The position of each distractor within the XYZ bounding volume was chosen according to one of the four probability distributions, which are illustrated in Figure 5
Figure 5
 
Four combinations of occlusion cues: (a) “visibility + range,” (b) “visibility,” (c) “range,” and (d) “neither.” The combinations illustrate an XZ slice through a short bar scene or YZ slice through a long bar scene. In each case illustrated here, the left target in the Figure is closer to the observer than the right target (recall viewing geometry of Figure 4). Distractors are illustrated with uniform gray squares on a set of uniform grids. In the actual stimuli, the gray levels are random and the positions and orientations are random as well (uniform in distribution within each marked region).
Figure 5
 
Four combinations of occlusion cues: (a) “visibility + range,” (b) “visibility,” (c) “range,” and (d) “neither.” The combinations illustrate an XZ slice through a short bar scene or YZ slice through a long bar scene. In each case illustrated here, the left target in the Figure is closer to the observer than the right target (recall viewing geometry of Figure 4). Distractors are illustrated with uniform gray squares on a set of uniform grids. In the actual stimuli, the gray levels are random and the positions and orientations are random as well (uniform in distribution within each marked region).
The 3D clutter was rendered under perspective projection, with various combinations of binocular stereo and motion parallax as described earlier. The only exception to perspective projection is that each target was rescaled to remove its perspective information about depth—the so-called size cue. Each target was rescaled in real time such that the visual angle of the target always was equal to the visual angle of the unscaled target at a depth Z = 70. Each short bar was 1 × 2 degrees and each long bar target was 1° high. Removing the size cue from the targets is a standard manipulation in depth discrimination experiments (e.g., Blakemore, 1970). 
To examine occlusion cues, we manipulated the distributions of the clutter (see Figure 5). The two cues were visibility and range, as described earlier, and we combined these two cues in four ways as follows: 
  •  
    “visibility + range”: The distractor distribution is uniform over the 3D bounding volume. The occlusions provide a depth cue because the expected visibility of the target decreases with depth (Figure 2). Occlusions provide a range cue because the expected maximum occluder depth of a target increases with target depth (Figure 3).
  •  
    “visibility” (and no range): The distractor distribution is spatially nonuniform. The probability that a distractor appears in the left versus right half volume each is 0.5 as in the uniform case. For the half volume that contains the near target, the probability in the depth interval [Znear, Zfar] from (a) is moved to the interval [Zfar, Zmax]. For the half volume containing the far target, the distractor probability in the depth interval [Znear, Zfar] is moved to the interval [Zmin, Znear]. This new distribution provides a visibility cue as in (a), but the range cue is removed since the depth range of foreground occluders is approximately the same for the near and far targets. We will discuss this distribution again later in the paper when we examine ideal observers.
  •  
    “range” (and no visibility): Here, we make the probability density nonuniform but in a different way from (b). For each left/right half volume, the probability that a distractor has depth less than the target is 0.5 and the probability that the distractor has depth greater than the target is 0.5. There is no visibility cue since the expected visibilities of the near and far targets are the same. There is a range cue since the distractors that can occlude the near target all must have depth less than Znear whereas the distractors that can occlude the far target can have depths up to Zfar.
  •  
    “neither” (neither visibility nor range): Here, each distractor has depth less than Znear with probability 0.5 and depth greater than Zfar with probability 0.5, regardless of in which half volume the distractor lies. There is neither a visibility nor range cue since the distractor distributions are the same in the two half volumes.
One further manipulation to the probability distributions was used. If a distractor's position and orientation was such that the distractor intersected a target, then the distractor was shifted slightly in depth to remove this intersection. (We allow intersections between distractors, but not between distractors and targets.) The reason for avoiding the intersections between distractors and targets was that the intersection lines could have provided edge features that directly indicate depth of the target from binocular stereo or motion parallax cues. It was important to remove these intersection features so that we could isolate the range cue, in particular. 
Design
Four depth cues were combined, namely two parallax cues (stereo and motion) and two occlusion cues (visibility and range). This yielded 16 possible conditions. For some of these combinations, such as when none of the four cues was present, the task was considered to be impossible and so we did not test these conditions. For the short bar targets, there were two such impossible conditions, namely when there were no parallax cues and no visibility cue—regardless of whether there was a range cue. (The task was considered to be impossible for the “range” condition (Figure 5c) when no parallax cues were available since the range cue requires an estimate of the depth of the occluders.) For the long bar targets, the task was considered to be impossible for the conditions just mentioned and also for conditions where both the visibility and range cues were removed, even if stereo or motion cues were present. In those conditions, the parallax cues provide no depth information about the long targets. 
We also tested a “baseline” condition for the short bar targets in which both parallax cues were present, but the clutter was removed. This gave a total of 26 depth cue combination conditions, namely 25 clutter conditions and one baseline condition. 
Each observer ran all 26 conditions in a blocked design, with one staircase per block. The staircases will be described below (see Procedure). The ordering of the blocks was randomized for each observer. 
Observers
Thirty observers participated in the experiment. Fifteen used the Oculus Rift display, and 15 used the fishtank VR display. Each observer was a student at McGill University and was paid $10. Observers had little or no experience with psychophysics experiments. Each had normal or corrected-to-normal vision. We required that each observer could discriminate 50 arcsec of disparity to participate, namely level 6 of the Randot Stereo Test (Precision Vision). Observers were unaware of the purpose of the experiments. Informed consent was obtained using the guidelines of the McGill Research Ethics Board, which is consistent with the Declaration of Helsinki. 
Procedure
In each trial, observers indicated which of the two targets was closer to them. They responded by the pressing keys on the keyboard: left-right arrows for short bar targets, and up-down arrows for the long bar targets. 
As mentioned previously, a blocked design was used such that the combination of cues was fixed for each block. A one-up/one-down staircase was used for each block with different step sizes for down steps versus up steps. The ratio between the log of the up-step size and the log of the down-step size was chosen as 0.2845 (Garcia-Perez, 1998). Whenever the subject answered correctly, the distance ΔZ between targets was reduced by a factor 0.8. When the observer answered incorrectly, ΔZ was increased by a factor 2.19. This ratio aimed for approximately 78% correct. If ΔZ increased beyond 20 cm, which normally would put the targets outside the bounding box of the clutter, the near target was presented just in front of the front face at Zmin and the far target just beyond the back face at Zmax. This configuration made the task trivial since the near target was unoccluded and the far target was highly occluded. If the observer still answered incorrectly in this case, the same rule as aforementioned was used for choosing the next staircase level but in the next trial the targets were displayed at the same depths, that is, just below Zmin and just beyond Zmax. Each staircase began at level ΔZ = 12 cm and terminated after 12 reversals. To compute the threshold for a given staircase, we averaged the log of the ΔZ values for the last 10 reversals. 
For blocks in which the motion cue was present, observers were instructed to move their heads left and right. If they did not move, then a warning message was displayed reminding them to move, and the trial was discarded. When rendering head coupled perspective, we clipped the actual human observer's position to a horizontal XYZ line segment of size 30 × 0 × 10 cm, which was centered at position (0, 0, 60) relative to the center of the front face of the clutter cube. This restricted the viewing position used for rendering to always have the same Y value, which removed any possibility that the observers could use vertical motion parallax from the target's upper and lower (horizontal) edges. For blocks in which there was no motion cue, if observers moved their heads then a message was presented telling them not to move their heads (see Stimuli section) and the trial at that level was repeated with a new stimulus. 
The response time in each trial was limited to 4 s. If the observer did not respond, then the trial was discarded and another scene was generated using the same target distance. A prompt was displayed to remind the subject to respond in time. The experiment typically lasted close to 1 hr. 
Before running the experiment, each observer ran a short practice session with three conditions, each with stereo present: the short bar targets with and without motion parallax, and the long bar targets with motion parallax. There was no time limit in each trial of the practice session. As in the real experiment, the initial ΔZ was 12 cm and a staircase was used to determine the next level. Since the purpose of the practice session was merely to familiarize the subjects with the requirements of the task, we kept the session short: Each condition terminated with the first incorrect answer. 
Results
Our main goal was to examine whether the two types of occlusion cues are used to discriminate depth in 3D clutter, and how these visibility and range cues would interact with binocular stereo and motion parallax cues. If both visibility and range cues are used by human observers, then we expect performance to be best when both cues are present, and we expect performance to be better if one of these cues is present than if neither is present. We also expect that, within any of the four combinations of the visibility and range cues, performance would be best if both stereo and motion cues were present since previous studies have shown that stereo and motion cues combine to give better performance. Such studies traditionally use scenes containing isolated surfaces (Johnston, Cumming, & Landy, 1994; Bradshaw & Rogers, 1996) and some studies also have used scenes containing 3D clutter (Sollenberger & Milgram, 1993; Arthur et al., 1993). Note that the latter studies did not examine interactions between stereo and motion cues and occlusion cues explicitly, which was the goal of our experiments. 
Figure 6
 
For each row, the pairs on the left should be cross-fused and the pairs on the right should be viewed divergently. The four rows correspond to the conditions shown in Figures 5a through d. For each row, the closer target is on the left, and ΔZ = 8.
Figure 6
 
For each row, the pairs on the left should be cross-fused and the pairs on the right should be viewed divergently. The four rows correspond to the conditions shown in Figures 5a through d. For each row, the closer target is on the left, and ΔZ = 8.
Figure 7
 
The same conditions as in Figure 6, except now the upper target is closer.
Figure 7
 
The same conditions as in Figure 6, except now the upper target is closer.
Figure 8 shows the depth differences (ΔZ) thresholds for all 3D clutter conditions, namely for the two types of display (upper vs. lower rows) and for the short and long bar conditions (left vs. right columns). For conditions that we did not test, we plotted a threshold of 20 cm. These were conditions in which the task was considered impossible for ΔZ values less than 20 cm and the task was considered trivial when ΔZ was greater than 20 cm since in that case the near target was entirely visible. We verified by simulation that an observer who guesses when ΔZ < 20 and who is always correct when ΔZ ≥ 20 achieves a threshold of over 20. Specifically, we ran 15 simulated observers for each of these conditions, and mean thresholds for each condition were between 20 and 22 (not shown). 
Figure 8
 
Depth discrimination thresholds for the short and long bar bar targets (left and right columns) and for the Oculus Rift and Fishtank VR displays (upper and lower rows). The black line in each plot on the left shows the baseline threshold. Error bars show standard error of the mean.
Figure 8
 
Depth discrimination thresholds for the short and long bar bar targets (left and right columns) and for the Oculus Rift and Fishtank VR displays (upper and lower rows). The black line in each plot on the left shows the baseline threshold. Error bars show standard error of the mean.
Thresholds in Figure 8 are plotted as ΔZ values in cm. To convert these thresholds to stereo disparities, we use:    
For example, the min and max thresholds for the clutter conditions are roughly ΔZ = 2 and 20 and these correspond to 9 and 90 arcmin of disparity, respectively. The conversion assumes the observer is at 60 cm from the front face of the clutter. 
To compare thresholds across a pair of selected conditions, we used paired two-sided t tests (Microsoft Excel). We tested the null hypothesis that the means of two conditions across 15 observers are the same. Rather than choosing a particular p value threshold to reject the null hypothesis, we just state the p values. 
We first present the results for the two baseline conditions, which consisted of short bar targets, no clutter, and both stereo and motion cues (see black lines in Figures 8a, c). For the Oculus Rift display, the mean baseline threshold was ΔZ = 1.6 cm, which corresponds to a binocular disparity of about 7.3 arcmin. This disparity is slightly greater than the nominal interpixel distance (5.5 arcmin) of the Rift, although the Rift's resolution may be slightly worse than this nominal value because of chromatic aberration, which can occur if the eye position is misaligned with the center of the Rift lens. For the fishtank VR display, the mean baseline threshold was 0.2 cm, which corresponds to a binocular disparity of 0.9 arcmin. This disparity is slightly lower than the interpixel distance for the Acer monitor (1.55 arcmin at a screen viewing distance of 60 cm). 
The thresholds for the baseline condition were generally much lower than the thresholds for the 3D clutter. The only exceptions occurred for the Oculus Rift display when both stereo and motion cues were present. For example, when visibility and range cues also were present (Figure 8a, leftmost yellow bars), the difference between baseline versus 3D clutter threshold was only marginally significant (t = 1.94; p = 0.07). 
Although the 3D clutter did generally reduce performance, the clutter also provided occlusion cues that observers could use to improve their performance. To demonstrate that both the visibility and range cues improved performance, we carried out six two-way ANOVAs with repeated measures, for the short bar targets. Table 1(a) shows three ANOVAs for the Oculus Rift condition, namely one ANOVA for each combination of stereo and motion in which at least one was present. We required at least one of stereo and motion to be present so that the observers could perceive the distractor depths to some extent. Each ANOVA tested two factors, namely visibility and range. Table 1(b) shows similar ANOVAs for the fishtank VR display. 
Table 1
 
Two-way ANOVAs testing the effect of visibility and range cues for short bar targets.
Table 1
 
Two-way ANOVAs testing the effect of visibility and range cues for short bar targets.
Main effects of visibility and range cues were found for all six ANOVAs at the p < 0.05 level, except for one case namely the range cue with the fishtank VR display and with stereo + motion. In this case the p value was 0.09, which is close to significant. We did not expect interactions between the visibility and range cues, and indeed p values for interactions were greater than 0.05 in all but one of the six ANOVAs. 
We did not carry out ANOVAs for the long bar target conditions because the task was impossible to do if neither the visibility nor range cues was present. We know that observers were using the visibility and range cues with the long bars, though, since observers were able to perform the task well above chance whenever the visibility cue was present, and also when the range cues was present provided that either stereo or motion also was present. 
We next focus on the “visibility” only condition (Figure 5b), and ask whether adding stereo and motion provided any additional benefit. For example, for the short bar targets, thresholds in the “visibility” only condition were lower when stereo and motion cues were both present than when neither stereo nor motion were present (compare yellow vs. gray bars: Oculus Rift: t = 4.35; p = 0.0007; Fishtank VR: t = 4.63; p = 0.0004). This is not a trivial result, since the targets have no texture on them and only the left and right edges of the short targets could provide accurate disparity or motion parallax information, and these left and right target edges often were occluded. As an aside, we note that adding texture to the targets could lower thresholds somewhat, but limited target visibility would still be a problem. For example, Figure 2c (gray curve) shows that when ΔZ is small, namely when both targets are near the center of the clutter at Z = 10, only about one fifth of each target's area is binocularly visible. 
For the long targets, stereo and motion cues did not improve performance for the “visibility” condition. For example, the yellow versus gray bars in the “visibility” condition of Figures 8b and d indicate roughly equal thresholds. In this condition, stereo and motion provide no direct information about depths of the long targets, since the left and right vertical edges of the long targets are hidden. Although having stereo and motion does provide some extra visibility information because there are multiple views, this extra information is very small as we will see later when we present the visibility-based ideal observer. 
For the “range” conditions, stereo and motion cues obviously were used since observers could do the task if one of these two cues was present. Interestingly, thresholds in the “range” condition were lower for the motion cue than for the stereo cue (green vs. orange bars, respectively) for all Figures 8a through d, with p values for t tests being less than 0.04 in all four cases. The reason for this result may be that stereo is less reliable than motion parallax when depth differences are large, because of stereo fusion limits (Blakemore, 1970; Wilcox & Allison, 2009). That is, stereo suffers beyond Panum's fusional area, but motion parallax does not. 
Discussion
Our experiments have shown that humans use both visibility and range cues for depth discrimination in 3D clutter. We have also described examples of how these cues interact with binocular stereo and motion parallax cues. We next turn to the questions of how much information is available from the visibility and range cues, and how well the visual system uses this information. 
Ideal observers
We define ideal observers for each of the visibility and range cues. Rather than computing thresholds by running staircases, we use the method of constant stimuli and we plot percent correct scores. Ideal observer thresholds are defined by a depth difference where they achieved 78% correct. 
To define the ideal observers, we generated the same random scenes that were used in the experiments. For each scene, a set of rays was cast from a standard eye position to a regular sampling grid of positions on each of the two targets. The sampling grid resolution was 0.5 mm. The visibility of a target was defined as the fraction of these cast rays that did not intersect a distractor. This was the computation used for the plots in Figures 2 and 3 as well. The visibility-based ideal observer then chose the target with larger visibility to be the closer one. The range-based ideal observer used the same set of cast rays, but for each target it considered the rays that hit an occluder and it computed the maximum Z value of these occlusion rays. This maximum Z value is a lower bound on the target depth. The range-based observer chose the target with the smaller maximum Z value to be the closer target. 
We considered eight stimulus conditions, namely the four combinations of visibility and range cues and the two types of target (short and long). For each of these eight conditions, we used 20 levels of ΔZ and 5,000 example scenes for each level. 
Figure 9 shows percent correct scores for the visibility-based ideal observer. The 78% thresholds were about 5 cm and 2 cm for the (a) short and (b) long bar targets, respectively, both for the “visibility” and “visibility and range” conditions. This performance is similar to that of the human observers in these conditions, in particular when neither stereo nor motion cues were present (two leftmost gray bars in each plot of Figure 8). This similar performance between the ideal and human observers in these conditions suggests that humans made near full usage of the visibility cue. Performance of the visibility-based ideal observer was at chance for the conditions without the visibility cue, namely the “range” and “neither” conditions (Figure 5c, d) which supports our claim that there was no visibility cue information present for these conditions. 
Figure 9
 
Visibility-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 9
 
Visibility-based ideal observers. (a) Short bar targets; (b) long bar targets.
Finally, note that performance of the visibility-based ideal observer generally was better for the long bar targets than the short bar targets. The reason is simply that the number of samples was greater for the long bar targets, which reduced the variance of the visibility estimate of each target. 
Figure 10 shows the percent correct scores for the range-based ideal observers. The 78% thresholds in the “range” condition were approximately 0.6 cm and 0.2 cm for the short and long bar targets, respectively. These thresholds are very low, which shows that the maximum occluder depths for the two targets carries strong information about the target depths. As we have discussed, however, to use this range cue the observer must be able to estimate the depths of the occluders. Human observers are not given the occluder depths, but rather they must estimate the occluder depths from stereo, motion, or other cues. This is presumably a key reason why human observer thresholds are much greater than the range-based ideal observer's thresholds in the “range” condition. For example, with the long bar targets, human observer thresholds in the “range” condition were over 4 cm, even when both stereo and motion were present, which is much greater than the 0.2 cm ideal observer threshold mentioned previously. 
Figure 10
 
Range-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 10
 
Range-based ideal observers. (a) Short bar targets; (b) long bar targets.
Further experiments are needed to isolate the factors that contribute to the relatively large human observer thresholds in the “range” condition. We believe there are two distinct factors to consider here. The first is the uncertainty in the depth estimates of the occluders. To model this uncertainty, one could add depth noise to the ideal observer, and there a few ways one could do this. For example, one could add noise independently to the depth estimates at each image pixel or one could add a single depth noise value to each occluder. The former might be preferred if one were modeling a population code of stereo or motion detectors, whereas the latter might be preferred if one were considering that the visual system groups the pixels of each occluder. Note that the latter case is perhaps optimistic, since occluders are sometimes difficult to visually segment from each other. Another important consideration when modeling depth noise is that deeper occluders likely have noisier depth estimates, since deeper occluders are more likely to be partly occluded by shallow occluders. 
The second factor that one needs to consider is how humans combine these noisy depth estimates. We have defined the “range” cue according to the maximum depth of the occluders. However, human observers might be using other statistical properties of the occluder depths, such as the mean (rather than max) occluder depths, or the mean depths of distractors in neighborhood of each target. Further experiments are needed to address what information observers are indeed using. 
We next turn to the question of when the various cues are independent, and when there is some information “leakage” between cues. First consider the visibility and range cues. An example of information leakage is shown in Figure 10 where the range-based ideal observer benefits slightly from the visibility cue, when the latter cue is present. For example, the range-based observer performed above chance in the “visibility” condition (Figure 5b). We believe the reason is that in this condition the maximum occluder depth for the far target tends to be slightly greater than that of the near target, since the number of occluded points for the far target tends to be greater and so there are more depth samples over which the maximum depth is computed. Similarly, the range-based observer's performance was slightly greater in the “visibility + range” condition (Figure 5a) than in the “range” condition (Figure 5c) since the difference in maximum occluder depths for the near and far targets tends to be smaller in the latter condition. We do not believe that this information leakage contributed to the human observer performance, since it is made only a small difference to the range-based ideal observer and since humans perform so much more poorly than the ideal observer using the range cue. To state this another way, we believe that when the visibility cue is present, humans benefit from it much more as a visibility cue than by the additional range information that it provides. 
A second possible source of information leakage is from the stereo or motion conditions into the visibility or range cues. To examine this possibility, we defined binocular versions of the above ideal observers and asked whether having a second view itself provides extra information about visibility and range. Binocular visibility-based ideal observers were defined by computing the visibilities of each target in two eye views (separated by 6.5 cm), averaging the visibility of each of the two targets across the two eyes, and then comparing the average visibilities of the two targets. Similarly, binocular range-based ideal observers were defined by pooling the occlusion rays for the left and right eyes for each target, and comparing the maximum Z value for the pooled occlusion rays for each target. 
The monocular and binocular ideal observers performed very similarly to each other. For example, Figure 11 compares the two visibility-based ideal observers for the short bar targets. This was the condition in which the monocular and binocular ideal observers had the biggest difference, and this difference was quite small. To understand why the monocular and binocular performance was so similar, consider what happens at various ΔZ. For small ΔZ, the expected visibilities of the two targets are nearly the same, and so the monocular visibility-based observer performs at near chance. The binocular observer has little advantage here, since averaging the visibilities of the two eyes reduces the variability of the visibilities only slightly, and the variability remains much greater than ΔZ (recall error bars in Figure 2b). For example, if the visibilities of the two eyes were independent, then averaging the visibilities of the two eyes would reduce the standard deviation of visibility only by a factor 1 / Display FormulaImage not available but, in fact, the visibilities of a target in the two eyes are not independent and so averaging the visibilities reduces the standard deviation of visibility by less than that factor. As ΔZ increases, the performance of both the monocular and binocular visibility-based ideal observers increase. However, the binocular observers gain less from the second view than they did in the case that ΔZ was small since, as ΔZ increases, the disparity differences between the near target and the occluders of the near target become smaller, because the near target on average is closer to its occluders. It follows that the visibilities of the left and right views of the near target become more similar, and hence the second view of the near target becomes more correlated with the first. A different argument holds for the far target but the result is the same. As ΔZ increases, the visibility of the far target approaches zero in both left and right views, so again the binocular visibility-based ideal observer benefits little from the second view.  
Figure 11
 
Monocular versus binocular visibility-based ideal observer, for short bar target scenes rendered with “visibility” cue only.
Figure 11
 
Monocular versus binocular visibility-based ideal observer, for short bar target scenes rendered with “visibility” cue only.
Similar arguments can be made for the range cue. The second view provides additional occluder depth values from which the observer computes the maximum occluder depths, but the depth values from the second view tend to be the same as those of the first view and hence uninformative. Similar arguments also can be made for the motion parallax ideal observers, both for the visibility and range cues. Assuming the moving observer's position is horizontal along the line segment joining the positions of the two eyes of the stereo condition, these additional views provide mostly redundant information. 
We should emphasize that the ideal observers are only ideal with respect to the specific information we have identified. There are other sources of depth information present in our stimuli, however, which may have influenced the performance of the human observers. For example, since the distractors are rendered using perspective projection, there is monocular information available about the distractor distributions from perspective cues. Distractors near the front the volume have a larger image size and smaller image density than distractors at the back of the volume. We believe that this image size information is difficult for observers to use, since the distractors are randomly oriented and occlude one another. Nonetheless, the information is there. Another type of perspective cue occurs in conditions in which the 3D density of the distractors differs between the two foreground regions (Figure 5b, c). For example in the “range” condition (Figure 5c), the density of distractors in front of the near target is greater than the density in front of the far target. This density difference is apparent when ΔZ is large, even in monocular viewing since the edge of the foreground region is fuller in the denser half. If one looks for this cue, one can perform the task at a level above chance in the “range” condition by choosing the close target to be the one behind the denser occluders. Similarly, in the “visibility” condition, there is also a difference in density in front of the near targets. However, in this case, to perform above chance one would need to choose the target behind the less dense occluders. We believe it is unlikely that our observers used these perspective cues, since the observers were naïve, blocks were short (12 reversals only), the density differences just described were only apparent for large ΔZ, and the relationship between density and the correct answer in each condition is not obvious. Nonetheless, such density differences from perspective could have played some role. 
Another perspective cue was the x and y positions of the short bar targets in the image. Recall that we jittered these target positions in their 3D X and Y positions in each scene. The reason for doing so was that, for any fixed (X,Y) position, the eccentricity of the point in the image varies inversely with Z and so this eccentricity would have been a cue to depth. By jittering this (X,Y) position of each target, we reduced the reliability of this cue,. We have computed an ideal observer that is based on the maximum eccentricity of a target point (not shown). This observer performs slightly above chance for large ΔZ, but below 78% threshold. Thus we believe that this cue provided no benefit for human observers. 
Target size
Recall that we controlled the image size of the target to remove the perspective cue. This created a cue conflict in the sense that, under normal linear perspective, the solid angle of a target should vary inversely with the distance squared from the observer to the target. That is, the size cue specified that the targets were at the same depth, but the other cues specified that the targets were at different depths. What if we had not held the visual angle of each target constant and instead let the visual angle vary according to perspective so that a target size cue was present? How would this change the information available for doing the task? 
Figure 12 plots the performance of a monocular ideal observer that compares the visible solid angles of the two targets, rather than the fractional visibilities of the targets. Three rendering conditions are tested. In the “visibility, size” condition (light green), the distractor distribution of Figure 5b is used, and the target size cue is present, i.e., the target is not resized to hide the size cue. In the “visibility” condition (yellow), the size cue is not present, which gives the same result as in previous figures for this condition since the visible solid angle is equivalent to the fraction visible in the case that the targets are rescaled to remove the size cue. The “size” condition (dark green) uses the distractor distribution of Figure 5d in which neither the visibility nor range cue is present; now the target is rendered with the size cue present so that closer targets tend to have larger visible solid angle even though there is no (fractional) visibility cue. The three plots show that the size cue does improve performance, but that the improvement is small. For example, adding the size cue to the “neither” condition yields only 70% correct at a very large depth difference (16 cm). Adding the size cue in the “visibility” condition lowers thresholds only by a few centimeters. We conclude that in our 3D cluttered scenes, the visibility cue as we have defined it (fraction visible) provides much more information for an ideal observer than does the size cue. 
Figure 12
 
Target solid angle based ideal observer for short bar targets.
Figure 12
 
Target solid angle based ideal observer for short bar targets.
Finally, it has been suggested to us that an alternative way to reduce the effect of the angular size cue from perspective would be to randomize the target sizes in 3D. This would be analogous to what has been done in shape from texture studies where one varies the size of texture elements to reduce the reliability of the size or “scale” cue (e.g., Knill, 1998). However, there is a subtle issue that arises with such a manipulation in the case of 3D clutter, namely that it would also reduce the reliability of the visibility cue as we have defined it, namely fraction visible. A key issue here in general is that, when a target is partly occluded, observers cannot be sure what the image size (or shape) of each target would have been without the occluders present, and so they cannot know for sure what the fractional visibility is. Indeed in our experiment, there is no distinction between the fractional visibility cue and the visible solid angle cue, and so in fact we do not know which of these cues observers were using. By holding the projected size of the target to be constant in our experiment, we are really only showing that human observers are responding to an image property that is correlated with the fractional visibility. 
Other cue conflicts
As we discussed already, controlling the size cue created a cue conflict since the image sizes of the targets did not vary with depth as they should according to the laws of perspective. The main disadvantage of cue conflicts is that the visual system might adopt arbitrary strategies to reconcile the conflicts, and such strategies might not reflect what the visual system does in natural situations where the cues do not conflict. In general, cue combination studies try to avoid cue conflicts by adding cue noise that varies the reliability of each cue or by perturbing the levels of the cues to keep the conflicts small in magnitude (Landy, Maloney, Johnston, & Young, 1995). We did not take this approach in our experiments, and so here we provide more details about the cue conflicts that are present in our stimuli, and how these conflicts may limit what we can conclude from our results. 
First consider the stereo and motion cues. For stimuli that we presented on the Oculus Rift display, we removed the stereo cue by showing both eyes the same image at all times. Such monoscopic viewing introduced a cue conflict when the motion was present, since the stereo cue implied all surfaces are at the same depth whereas the motion cue implied surfaces were at multiple depths. Observers in our experiment could have reconciled this conflict in several ways, for example, by ignoring the depth information from either the stereo or motion cue, or by taking a linear combination of the depth estimates from these two cues. Our results suggest that observers used the motion cue and they just ignored the conflicting stereo cue which provided no information for doing the task anyhow. For the fishtank VR display, we avoided this cue conflict by “virtually” patching one eye. 
Another cue conflict existed in conditions in which we removed either the visibility cue or the range cue. Recall that in these conditions we set the foreground density of the distractors to be different for the two targets (see Figures 5b, c). This created a cue conflict because that the visibility and range cues require that the density of the foreground distractors are spatially uniform. In fact, the distractor density information provided by stereo or motion cues could have informed the observers that the visibility or range cue was less reliable. Indeed, for our stimuli, the difference in density in the two half volumes is noticeable in the case that ΔZ is large, even under monocular viewing. In this case, observers may have ignored the visibility cue or range cue (whichever is present) and instead selected the closer target as the one behind the denser (or less dense) foreground clutter. Although we cannot rule out such a strategy, we think the strategy is unlikely since it would have led to systematically incorrect responses when there was only a visibility cue (Figure 5b) and to systematically correct responses when there was only a range cue (Figure 5c) and we did not observe such behavior. Nonetheless, the possibility exists and should be considered when designing future experiments. 
A related question is: To what extent does the visibility cue require that the 3D clutter density is uniform? It is unclear how to answer this question. To appreciate the difficulty, consider the analogous question for shape from texture models that typically assume that the texture element positions are “homogeneous” on the surface (Blake, Bülthoff, & Sheinberg, 1993; Knill 1998]. The question there would be: To what extent do shape from texture cues require that the texture is homogeneous? We are not aware of any shape from texture studies that have addressed this question. Another important yet difficult question to answer is whether the homogeneity assumption is realistic for describing 3D clutter in the natural world. (Indeed the analogous question again could be asked about 2D texture that appears on natural surfaces.) Some specific models of 3D clutter in tree canopies have been proposed (e.g., see citations in Langer & Mannan, 2012), and one avenue to explore might be how well observers can distinguish such distributions from each other. 
Target depths from stereo and motion parallax
Finally, we turn to a more traditional question of what depth information is provided directly by stereo and motion parallax cues in 3D clutter. This information is important for estimating the depth of the targets, and it is also important for the range cue, which requires information about the depths of the occluders. We keep the discussion at a high level here and try to connect to previous work. We also restrict our discussion to stereo only, noting that analogous points could be made for motion parallax cues as the two cues are geometrically related. 
Binocular stereo cues can provide at least three types of information for discriminating the target depths, in particular, the short bar targets. The first is the positional disparity of those points on the targets that can be identified in both the left and right eye views. As we discussed earlier, such points are on the left and right vertical edges of the short bars only since the surfaces are not textured. If we were to texture the targets with random dots, then the disparity of points internal to the target could also be used. The disparity information would still be limited however, since, as the depth of a target increases into the clutter, the binocular visibility decreases faster than the monocular visibility (Figure 2c). 
A second type of disparity cue is da Vinci stereopsis (Nakayama & Shimojo, 1990; Harris & Wilcox, 2009). For our stimuli, this cue arises when the vertical side edge of a short bar target is visible to one eye but not to the other. Although most studies of da Vinci stereopsis consider only simple scene geometries, the cue is available in 3D cluttered scenes as well. Note, however, that the cue can inform the observer only about the depth of the target relative to the depth of the occluder, so it could be useful for our task only when the occluder's depth can be perceived reliably. 
A third type of disparity information could come from the envelope (or say convex hull) of positions of the target points that are visible in each eye. The observer could compare the spatial envelopes of each visible target in the left versus right images and estimate the disparity of these envelopes for each target. This cue is reminiscent of the Gaussian envelope cue that is used in studies of second order stereopsis (Wilcox & Allison, 2009). It is plausible that the mechanism for perceiving depth from second order stereopsis could be used for depth perception in 3D clutter. Future experiments and modeling efforts are needed to determine how much information is available from the above cues, and from the corresponding motion parallax cues. 
Conclusions
Our experiments and analyses have provided new insights into depth perception in 3D cluttered scenes, in particular, scenes in which the clutter is dense and so occlusion effects cannot be ignored. We have identified two new metric occlusion cues to depth in 3D clutter, namely a visibility cue and a range cue. We have shown how humans combine these depth cues with stereo and motion parallax. One might have expected that 3D clutter simply interferes with depth perception by reducing the information from binocular disparity and motion cues, but our experiments have shown that the situation is more complicated than that. Occlusions also provide information that observers use to discriminate the depths of identifiable targets that are embedded within the 3D clutter. A clear example of this is that observers are able to discriminate the depths of long bar targets, which carry no information about depth from binocular stereo and motion parallax. 
More generally, 3D cluttered scenes provide a rich and natural but neglected domain for studying depth perception. We have concentrated on how occlusion cues are combined with stereo and motion parallax but other cues should be examined as well, including perspective and shading. Finally, 3D clutter is common in natural scenes, but there has been little work in vision science to quantify how common it is and what the implications are (Changizi & Shimojo, 2008). We hope that some of the ideas of this paper could stimulate the community to address these questions. 
Acknowledgments
This research was supported by a Team Grant from FQRNT (Fonds québécois de la recherche sur la nature et les technologies). The authors would like to acknowledge Roy Breidi for implementing the fishtank VR setup, and many others for helpful discussions including Fred Kingdom, Curtis Baker, Milena Scaccia, and Jim Clark. 
Commercial relationships: none. 
Corresponding author: Michael S. Langer. 
Email: michael.langer@mcgill.ca. 
Address: School of Computer Science, McGill University, Montreal, Canada. 
References
Akerstrom, R. A, Todd J. T. (1988). The perception of stereoscopic transparency. Perception & Psychophysics, 5, 421–432.
Andersen G. (1989). Perception of three-dimensional structure from optic flow without locally smooth velocity. Journal of Experimental Psychology: Human Perception and Performance, 2, 363–371.
Arthur K. W, Booth K. S, Ware C. (1993). Evaluating 3D task performance for fish tank virtual worlds. ACM Transactions on Information Systems, 3, 239–265.
Blake A, Bülthoff H. H, Sheinberg D. (1993). Shape from texture: Ideal observers and human psychophysics. Vision Research, 33 (12), 1723–1737.
Blakemore C. (1970). The range and scope of binocular depth discrimination in man. Journal of Physiology, 211, 599–622.
Bradshaw M. F, Rogers B. J. 1996. The interaction of binocular disparity and motion parallax in the computation of depth.” Vision Research, 36 (21), 3457–3468.
Burge J, Fowlkes C. C, Banks M. S. (2010). Natural-scene statistics predict how the figure-ground cue of convexity affects human depth perception. The Journal of Neuroscience, 30, 7269–7280.
Changizi M. A, Shimojo S. (2008). X-ray vision and the evolution of forward-facing eyes. Journal of Theoretical Biology, 254, 756–767.
Garcia-Perez M. A. (1998). Forced-choice staircases with fixed step sizes: asymptotic and small-sample properties. Vision Research, 38 (12), 1861–1881.
Harris J, Wilcox L. (2009). The role of monocularly visible regions in the perception of three-dimensional scenes. Vision Research, 49, 2666–2685.
Harris J. M. (2014). Volume perception: Disparity extraction and depth representation in complex three-dimensional environments. Journal of Vision, 14(12), 11, 1–16. doi:10.1167/14.12.11. [PubMed] [Article]
Johnston E. B, Cumming B. G, Landy M. S. (1994). Integration of stereopsis and motion shape cues. Vision Research, 34, 2259–2275.
Knill D. C. (1998). Surface orientation from texture: ideal observers, generic observers and the information content of texture cues. Vision Research, 38, 1655–1682.
Landy M. S, Maloney L. T, Johnston E. B, Young M. J. (1995). Measurement and modeling of depth cue combination: In defense of weak fusion. Vision Research, 35, 389–412.
Langer M. S, Mannan F. (2012). Visibility in three dimensional cluttered scenes. Journal of the Optical Society of America A, 9, 1794–1807.
LaValle S. M, Yershova A, Katsev M, Antonov M. (2014). Head tracking for the Oculus Rift. IEEE International Conference on Robotics and Automation (ICRA), 187–194.
Nakayama K, Shimojo S. (1990). Da Vinci stereopsis: Depth and subjective occluding contours from unpaired image points. Vision Research 30, 11, 1811–1825.
Sollenberger R. L, Milgram P. (1993). Effects of stereoscopic and rotational displays in a three-dimensional pathtracing task. Human Factors: The Journal of the Human Factors and Ergonomics Society, 3, 483–499.
Sutherland I. E. (1968). A head-mounted three dimensional display. Proceedings of the AFIPS Fall Joint Computer Conference, (pp. 757–764).
Tsirlin I, Allison R. S, Wilcox L. M. (2008). Stereoscopic transparency: Constraints on the perception of multiple surfaces. Journal of Vision, 8(5), 5, 1–10. doi:10.1167/8.5.5. [PubMed] [Article]
van Ee R, Anderson B. (2001). Motion direction, speed, and orientation in binocular matching. Nature, 410, 690–694.
Wilcox L. M, Allison R. S. (2009). Coarse-fine dichotomies in human stereopsis. Vision Research, 49, 2653–2665.
Figure 1
 
Examples of the 3D cluttered scenes. (a) Short bar targets; (b) long bar targets.
Figure 1
 
Examples of the 3D cluttered scenes. (a) Short bar targets; (b) long bar targets.
Figure 2
 
Visibility cue. (a) As a target's depth increases, the target tends to be more occluded and hence less visible. (b) Mean and standard deviation of visibility of short bar targets for one eye's view. The plot for the long bar targets is similar (not shown) but the standard deviations are about one-third as large. (c) Blue curve shows probability that a target point is visible to a monocular observer, say the right eye. Red curve shows probability that a target point is visible to both eyes of a binocular observer. Gray curve is the ratio of these binocular to monocular probabilities, which is the conditional probability that a target point is visible to both eyes, given that it is visible to the right eye (see figure 8 of Langer & Mannan, 2012).
Figure 2
 
Visibility cue. (a) As a target's depth increases, the target tends to be more occluded and hence less visible. (b) Mean and standard deviation of visibility of short bar targets for one eye's view. The plot for the long bar targets is similar (not shown) but the standard deviations are about one-third as large. (c) Blue curve shows probability that a target point is visible to a monocular observer, say the right eye. Red curve shows probability that a target point is visible to both eyes of a binocular observer. Gray curve is the ratio of these binocular to monocular probabilities, which is the conditional probability that a target point is visible to both eyes, given that it is visible to the right eye (see figure 8 of Langer & Mannan, 2012).
Figure 3
 
Range cue. (a) If the occluder on the right is deeper, then it provides a greater lower bound on the target depth, and so the target on the right is more likely to be deeper. (b) The mean and standard deviation of the maximum depth of an occluder for a short bar target of depth Z. The corresponding plot for the long bar targets is not shown because the “max occluder” curve is nearly identical to the target curve and error bars nearly vanish.
Figure 3
 
Range cue. (a) If the occluder on the right is deeper, then it provides a greater lower bound on the target depth, and so the target on the right is more likely to be deeper. (b) The mean and standard deviation of the maximum depth of an occluder for a short bar target of depth Z. The corresponding plot for the long bar targets is not shown because the “max occluder” curve is nearly identical to the target curve and error bars nearly vanish.
Figure 4
 
The observer's task is to judge which of the two red targets is closer. The 3D clutter is contained in a cube. The uniform grid of distractors illustrates that a uniform probability distribution was used. See text for details about the 3D clutter and viewing parameters.
Figure 4
 
The observer's task is to judge which of the two red targets is closer. The 3D clutter is contained in a cube. The uniform grid of distractors illustrates that a uniform probability distribution was used. See text for details about the 3D clutter and viewing parameters.
Figure 5
 
Four combinations of occlusion cues: (a) “visibility + range,” (b) “visibility,” (c) “range,” and (d) “neither.” The combinations illustrate an XZ slice through a short bar scene or YZ slice through a long bar scene. In each case illustrated here, the left target in the Figure is closer to the observer than the right target (recall viewing geometry of Figure 4). Distractors are illustrated with uniform gray squares on a set of uniform grids. In the actual stimuli, the gray levels are random and the positions and orientations are random as well (uniform in distribution within each marked region).
Figure 5
 
Four combinations of occlusion cues: (a) “visibility + range,” (b) “visibility,” (c) “range,” and (d) “neither.” The combinations illustrate an XZ slice through a short bar scene or YZ slice through a long bar scene. In each case illustrated here, the left target in the Figure is closer to the observer than the right target (recall viewing geometry of Figure 4). Distractors are illustrated with uniform gray squares on a set of uniform grids. In the actual stimuli, the gray levels are random and the positions and orientations are random as well (uniform in distribution within each marked region).
Figure 6
 
For each row, the pairs on the left should be cross-fused and the pairs on the right should be viewed divergently. The four rows correspond to the conditions shown in Figures 5a through d. For each row, the closer target is on the left, and ΔZ = 8.
Figure 6
 
For each row, the pairs on the left should be cross-fused and the pairs on the right should be viewed divergently. The four rows correspond to the conditions shown in Figures 5a through d. For each row, the closer target is on the left, and ΔZ = 8.
Figure 7
 
The same conditions as in Figure 6, except now the upper target is closer.
Figure 7
 
The same conditions as in Figure 6, except now the upper target is closer.
Figure 8
 
Depth discrimination thresholds for the short and long bar bar targets (left and right columns) and for the Oculus Rift and Fishtank VR displays (upper and lower rows). The black line in each plot on the left shows the baseline threshold. Error bars show standard error of the mean.
Figure 8
 
Depth discrimination thresholds for the short and long bar bar targets (left and right columns) and for the Oculus Rift and Fishtank VR displays (upper and lower rows). The black line in each plot on the left shows the baseline threshold. Error bars show standard error of the mean.
Figure 9
 
Visibility-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 9
 
Visibility-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 10
 
Range-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 10
 
Range-based ideal observers. (a) Short bar targets; (b) long bar targets.
Figure 11
 
Monocular versus binocular visibility-based ideal observer, for short bar target scenes rendered with “visibility” cue only.
Figure 11
 
Monocular versus binocular visibility-based ideal observer, for short bar target scenes rendered with “visibility” cue only.
Figure 12
 
Target solid angle based ideal observer for short bar targets.
Figure 12
 
Target solid angle based ideal observer for short bar targets.
Table 1
 
Two-way ANOVAs testing the effect of visibility and range cues for short bar targets.
Table 1
 
Two-way ANOVAs testing the effect of visibility and range cues for short bar targets.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×