The Bayesian framework provides an optimal way of combining the information contained within an image with prior assumptions about the nature of objects in the world. This approach has successfully been used to model human behavior in a range of visual tasks. In the current experiment, the sources of information in the image are the disparity and texture cues to shape. These are combined with a prior for convexity and a prior for frontoparallel. A detailed description of our model is provided in the “
1.” The model is similar in spirit to that presented in van Ee, Adams, and Mamassian (
2003).
Texture information can only be exploited by making assumptions about the original texture distribution on the surface in the world. In our model we assume that the distribution of lines is homogenous over the surface (lines are equally likely to be present at any point on the original surface). However, the shape of a ridge means that at different points on the ridge, a given patch size on the surface projects to differently sized patches in the image. This results in systematic changes in texture density across the image: for ridges with large depths, the left and right sides of the resulting image will have a higher density than in the middle of the image (
Figure 2). At long viewing distances, where the effects of perspective projection are small, this is largely a function of the local surface slant (see “
1”). Similarly, the orientation of a texture line in the image can be calculated from the line’s orientation on the original surface, its position, and the local slant of the surface. Because we assume a uniform distribution of texture line orientations on the original surface (the isotropy assumption), the orientation of lines in the image contain information about the probable shape of the object. For example, as the local slant of the surface gets larger, the projection of the texture lines becomes closer to vertical; this can be seen at the left and right edges of the images in
Figure 1. Given these assumptions of homogeneity and isotropy, we can calculate the likelihood for any ridge depth, of a line in the image at a particular position with a particular orientation. By considering each line in the image independently and multiplying together these likelihoods for each line, we can calculate the overall perspective likelihood for any image. The perspective likelihoods for the four textures used are shown in
Figure 2. There are no free parameters in our model for determining the texture likelihood. Any internal noise is assumed here to be negligible in the context of interpreting a randomly generated texture, whose information content is limited.
The second distribution is the disparity likelihood. This is simply defined as a Gaussian centered on the correct ridge depth. The width of the Gaussian is the first free parameter of the model and reflects the internal noise of the observer. This component of the model is eliminated when the “texture only” stimuli are considered. We must entertain the possibility that errors exist in estimates of depth from stereo due to mis-scaling retinal disparity with the incorrect viewing distance. In our set-up, there were multiple sources of information (vergence, accommodation, known distance to screen, and vertical disparities), all consistent with our viewing distance of 164 cm. We, therefore, chose not to incorporate separate biases into the disparity and texture likelihoods. Our observers’ depth judgments in both experiments were well modeled by incorporating a single prior for frontoparallel and/or residual flatness cues.
The third distribution is a prior for convexity. There is varied evidence that in the absence of other information, the visual system “assumes” a convex rather than a concave shape. We have implemented this by using a Gaussian centered on the ridge depth of 3 cm, corresponding to half of the ridge width. In other words, the prior assumption here is for a near-circular cylindrical shape. The spread of this Gaussian is a second, free parameter and reflects the strength of the prior assumption.
Finally, the fourth distribution is a prior for frontoparallel. In limited cue situations, depth is often underestimated. This has been interpreted as reflecting a prior for flatness and/or the presence of residual cues, such a accommodation and blur that arise from using a flat screen to present visual stimuli. The final distribution in our model incorporates both this possible prior and any residual information. It is modeled as a Gaussian centered on zero depth. The width of the distribution (the third and final free parameter) reflects the relative strength of the prior and the reliability of the residual cues.
All of the information — the likelihoods and the priors — are combined by multiplication. This is the optimal combination rule within a Bayesian framework and results in the posterior distribution. It is in this multiplication of the two likelihoods that disparity essentially serves to disambiguate the texture information. For example, consider the case when the texture cue corresponds to a ridge with a depth of ±7.5 cm and the stereo cue corresponds to a ridge depth of −7.5 cm. The texture cue is ambiguous, and its likelihood distribution has peaks at both −7.5 cm and +7.5 cm (see
Figure 1). In contrast, the stereo cue is not ambiguous, and its likelihood distribution has only one peak at −7.5 cm. Their product will have a single peak, located at −7.5 cm. In this sense, the stereo cue has disambiguated the texture cue.
We considered two decision rules, one where the output of the model is the maximum of the posterior distribution (MAP) and one where the response is the mean of the posterior. These are equivalent to having either very narrow or very broad gain functions, but in this instance produce very similar results. Maloney
2002) provides an analysis of gain functions. The presented fits from the model were calculated using the mean of the posterior distribution.
Figure 6 shows the individual observers’ data and the best fit from our model. It can be seen that the model provides a good fit for both of the stimulus conditions. For each observer we found the single set of three parameters that provided the best fit (least squared error) to the texture and disparity data and the texture only data. These are given in
Table 1. The model’s predictions for the “texture and disparity” data show a kink at around 0 cm disparity specified depth (e.g., observer ML in the ±7.5-cm texture condition). This corresponds to the point at which the peak in the posterior distribution moves from being close to the “concave” peak in the bimodal texture likelihood to being closer to the “convex” peak.