Traditionally, the view of 3D representation in human vision is a geometric one, in which images from the two eyes, or changes in a monocular image, are used to deduce the most likely 3D scene that could generate those views. In computer vision, this process is known as photogrammetry (Hartley & Zisserman,
2000; Longuet-Higgins,
1981). The output of this process is a description of the 3D location of points in the scene up to an unknown scale factor which can be provided by knowledge of the interocular separation or the distance traveled by the observer. In general, it is assumed that these scale factors are available to, and used by, the visual system in judging distances. This view of visual reconstruction predominates in the literature, even though there is debate about the extent to which the representation of space is distorted (Foley,
1980; Gogel,
1990; Johnston, Cumming, & Parker,
1993; Luneburg,
1950). There have been suggestions that there may be no single visual representation of a 3D scene that can account for performance in all tasks (Brenner & van Damme,
1999; Glennerster, Rogers, & Bradshaw,
1996). Instead, observers' responses may depend on separable ‘modules’ (Landy, Maloney, Johnston, & Young,
1995) and may not require a globally consistent map of space (Foo, Warren, Duchon, & Tarr,
2005; Glennerster, Hansard, & Fitzgibbon,
2001,
2009). In a very different context, cue combination approaches have been applied to judgements of surface slant (Hillis, Ernst, Banks, & Landy,
2002; Hillis, Watt, Landy, & Banks,
2004; Knill & Saunders,
2003) and object shape (Ernst & Banks,
2002; Johnston et al.,
1993). As we shall see, the predictions of a cue combination model for object location can be quite different from those of a 3D model of the scene (whether that model is distorted or not).