**Humans are remarkably well tuned to the statistical properties of natural images. However, quantitative characterization of processing within the domain of natural images has been difficult because most parametric manipulations of a natural image make that image appear less natural. We used generative adversarial networks (GANs) to constrain parametric manipulations to remain within an approximation of the manifold of natural images. In the first experiment, seven observers decided which one of two synthetic perturbed images matched a synthetic unperturbed comparison image. Observers were significantly more sensitive to perturbations that were constrained to an approximate manifold of natural images than they were to perturbations applied directly in pixel space. Trial-by-trial errors were consistent with the idea that these perturbations disrupt configural aspects of visual structure used in image segmentation. In a second experiment, five observers discriminated paths along the image manifold as recovered by the GAN. Observers were remarkably good at this task, confirming that observers are tuned to fairly detailed properties of an approximate manifold of natural images. We conclude that human tuning to natural images is more general than detecting deviations from natural appearance, and that humans have, to some extent, access to detailed interrelations between natural images.**

*by chance*shows the desired manipulation. Although this approach guarantees that the resulting “manipulations” remain natural, it is highly dependent on the indexing mechanism that is used to select the manipulated image. The selective sampling approach is dependent on the indexing mechanism used because, for any new feature, a new indexing mechanism would need to be implemented. More importantly, if there aren't sufficiently many exemplars for a given feature, a new database would be needed. This dependence creates a limitation on generalizing the selective sampling approach to higher levels of visual processing.

*generator*, from an isotropic Gaussian distribution to the space of images. One defining feature of GANs is the use of an auxiliary classification function, often called the

*critic*, to judge how good the generator mapping is. Specifically, the critic attempts to predict if a given image has been generated by mapping isotropic Gaussian noise through the generator, or if the image is an instance from the training database. Generator and critic are trained in alternation, where the generator is trained to increase the errors of the critic and the critic is trained to decrease its own error (for example, see Goodfellow et al., 2014, for details). In general, generator and critic can be any possible transformation, but they are typically implemented as artificial neural networks with multiple hidden layers (Goodfellow et al., 2014; Radford et al., 2016). Although never studied quantitatively, images generated from GANs look quite similar to natural images and manipulations in a GAN's latent space and seem to correspond in a meaningful way to perceptual experience. For example, Radford et al. (2016) start with a picture of a smiling woman, subtract the average latent representation of a neutral woman's face and add a neutral man's face to arrive at a picture of a smiling man. Similarly, Zhu, Krähenbühl, Shechtman, and Efros (2016) illustrate that projecting perceptually meaningful constraints back to a GAN's latent space allows creation of random images with specified features (e.g., edges or colored patches) in the specified locations. Together, these experiences suggest that GANs recover a reasonably good approximation to the manifold of natural images.

*G*that maps a latent vector

*to image space and a critic network*

**z***D*that takes an image as input and predicts whether that image is a real image from the training dataset or an image that was generated by mapping a latent vector through the generator network (see Figure 2 and Gulrajani et al., 2017, for details of the architecture of the two networks). The generator network and the critic network were trained in alternation using stochastic gradient descent. Specifically, training alternated between five updates of the critic network and one update of the generator network. Updates of the critic network were chosen to minimize the loss

*or training images*

**z***, respectively. Furthermore, ∇*

**y**_{y}denotes the gradient with respect to image pixels

*, which was evaluated at random points along straight line interpolations between real and generated images (see Gulrajani et al., 2017, for details). We set λ = 10 during training.*

**y***N*in Figure 2) were trained for 200,000 epochs using an ADAM optimizer (Kingma & Ba, 2015) with learning rate 10

^{−4}and

*β*

_{0}= 0,

*β*

_{1}= 0.9. Specifically, we trained networks with

*N*= 40, 50, 60, 64, 70, 80, 90, and 128 (see Figure 2). Wasserstein-2 error (Arjovsky et al., 2017) on a validation set (the CIFAR10 test dataset) was lowest, with

*N*= 90 in agreement with visual inspection of sample quality, so we chose a network with

*N*= 90 for all remaining analyses. Example images generated from this final network are shown in Figure 1B.

*r*= 0.82,

*p*< 10

^{−60}), but they are not exactly the same. Therefore, we constructed latent noise by manipulating the latent vector

*from which an image was generated. To generate perturbed images with a predefined difference in pixel space, we started by adding independent Gaussian noise*

**z***to*

**ζ***and determining the corresponding image*

**z***G*(

*+*

**z***). We then used gradient descent on*

**ζ***such that the final difference between the target and the perturbed target had a predefined pixel space difference of*

**ζ***δ*.

^{2}) on a Sony Triniton Multiscan G520 CRT monitor in a dimly illuminated room. The monitor was carefully linearized using a Minolta LS-100 photometer (Konica Minolta, Ramsey, NJ). Maximum stimulus luminance was 106.9 cd/m

^{2}, minimum stimulus luminance was 1.39 cd/m

^{2}. If the nominal stimulus luminance exceeded that range, it was clipped (for subsequent analyses, we also used the clipped stimuli). On every frame, the stimuli were rerendered using random dithering to generate a quasi-continuous luminance resolution (Allard & Faubert, 2008). At a viewing distance of approximately 87 cm, each stimulus image subtended approximately 0.65° of visual angle and were separated by approximately 0.13° of visual angle. One pixel subtended approximately 0.02° of visual angle.

*γ*= 0.5 is the probability to guess the stimulus correctly by chance,

*λ*is the lapse probability,

*σ*is the logistic function and

*a*and

*b*govern the offset and the slope of the psychometric function. Here,

*x*is the root-mean-square level of noise applied to perturb the respective images in dB relative to the screen's background luminance. However, we note that different ways of scaling the noise (other than dB) did not impact our main results. We adopted a Bayesian perspective on estimation of the psychometric function (Fründ, Haenel, & Wichmann, 2011) and used weak priors

*λ*∼ Beta(1.5, 20),

*a*∼

*N*(0, 100),

*b*∼

*N*(0, 100), where

*a*and

*b*are expressed on the dB scale of the noise. Mean a posteriori estimates of the critical noise level

*x*at which

_{c}*ψ*(

*x*) = 0.75 and the slope of the psychometric function at

*x*were obtained using numerical integration of the posterior (Schütt, Harmeling, Macke, & Wichmann, 2016).

_{c}*denote the noise-free target stimulus and*

**t***d*is a suitably defined distance measure. We used either the Euclidean distance

*and*

**x***and*

**x***. These distances were applied in either the GAN's latent space or directly in pixel space, after concatenating the respective stimulus' pixel intensities into one long vector. We then determined receiver operating curves (ROC) for predicting correct versus incorrect responses based on*

**y***c*. The area under the ROC is a measure for how well the respective distance measure predicts the observer's trial by trial responses (Green & Swets, 1966). To test if the area under the curve (AUC) was significantly different from chance, we performed a permutation test randomly reshuffling the correct/incorrect labels 1,000 times and taking the 95th percentile of the resulting distribution as the critical value. We also used permutation tests to determine if the AUC for two different distance measures was significantly different. For the pairwise comparisons, there are 128 possible reassignments of AUC values to the two conditions, and we computed all of them. The

*p*values for these post hoc comparisons were corrected for multiple comparisons to control for inflation of the false discovery rate (Benjamini & Hochberg, 1995).

*c*between luminance histograms of the respective images. Distance measures were computed over vectors of length 50, and each entry denotes the bin count. Secondly, to determine local dominant orientation at each pixel we first filtered the image with horizontal and vertical Scharr filters (Scharr, 2000) as implemented in scikit-image (van der Walt et al., 2014) giving local horizontal structure

*h*and vertical structure

*v*. The local orientation

*ϕ*was extracted from these two responses such that

*h*=

*r*cos(

*ϕ*) and

*v*=

*r*sin(

*ϕ*), where

*c*as the distance difference between these orientation histograms. As a third feature, we calculated the edge densities of the two images by using the canny edge detector from scikit-image with a standard deviation of 2 pixels and calculating the fraction of pixels labeled as edges by this algorithm. As a fourth feature we determined the slope of the power spectrum in double logarithmic coordinates.

*,*

**t***and*

**s***i, j*) of pixels

*s*=

_{i}*s*implies

_{j}*and*

**s***s*=

_{i}*s*and

_{j}*A*is true and

*i*,

*j*. If the two segmentations define exactly the same regions (but possibly with different labels),

*d*

_{segm}will be 0, if the two segmentations are completely different, in the sense that one has only one region (the entire image) and the other assigns each pixel to its own region, then

*d*

_{segm}will be 1.

*M*±

*SEM*). It was comparable for Fourier noise, 6.77 ± 0.83 dB and paired

*t*-test pixel versus Fourier noise,

*t*(6) = −1.59,

*ns*), and it decreased significantly for latent noise, 3.47 ± 0.36 dB, paired

*t*-test pixel versus latent noise,

*t*(6) = 3.59,

*p*= 0.011, and Fourier versus latent noise,

*t*(6) = 3.89,

*p*= 0.0080, respectively (see Figure 4B). Thus, overall observers were most affected by noise that was approximately applied within the manifold of natural images by perturbing the GAN's latent representation of the stimulus. We verified that this result also held for every individual observer. We further found that psychometric functions tended to fall off more steeply when noise was applied in the GAN's latent space (average slope at critical noise level for latent noise was −0.066 ± 0.0084/dB; see Figure 4C) than when noise was applied in pixel space (average slope at critical noise level for pixel noise was −0.024 ± 0.0041/dB, for Fourier noise −0.017 ± 0.0027/dB), replicating the observations from Figure 4A.

*SEM*; pixel space, 0.82 ± 0.014; and permutation test

*p*= 0.17).

*first*take the standard deviation of the target and flanker images and then compare differences in standard deviations. In latent space, the norm of the latent vector seems to be related to contrast as well, but the relationship is more complex. Radial distance receives considerably lower AUC than Euclidean distance in both latent and pixel space and was a much less reliable predictor of trial-by-trial performance. Radial distance in latent space was significantly less predictive than radial distance in pixel space (latent space, 0.57 ± 0.022; pixel space, 0.68 ± 0.016; and permutation test

*p*< 0.05 corrected) and for four out of seven observers, radial distance in latent space did not predict trial-by-trial choices significantly better than chance.

^{1}Cosine distance applied in latent space was a better predictor than if it was applied in pixel space (latent space, 0.82 ± 0.012; pixel space, 0.78 ± 0.019; and permutation test

*p*= 0.05 corrected).

*p*< 0.05 corrected). In fact, differences in segmentation were about as predictive of trial-by-trial behavior as cosine distance in latent space (permutation test

*p*= 0.60), suggesting that indeed distortions of the images' mid-level structure might be responsible for the decline in image-matching performance when noise was constrained to stay within the recovered manifold of natural images by applying it in the GAN's latent space.

*α*of 15°, 30°, 60°, or 90° after the first 30 frames. We will refer to this angle as the

*path angle*in the following. At a frame rate of 60 Hz, each video had a duration of 1s and if the video contained a turn in latent space, that turn happened after 500 ms. Otherwise, the setup for Experiment 2 was the same as in Experiment 1.

*f*power spectrum by multiplying their Fourier transform with 1/(0.1 +

*f*). The exact size of the root mean square difference between successive video frames varied somewhat from trial to trial, but there were no statistically significant differences between frame by frame differences in pixel space, in Fourier space, or in latent space (

*p*> 0.1). Animated examples are available as supplementary material (Supplementary Files S1–S18).

*d*′ at each path angle. For single observers, confidence intervals for

*d*′ were determined by bootstrap with 1,000 samples.

*d*′ = 2.23 ± 0.59). However, even for path angles of 15°, the smallest path angles tested, three out of five observers performed above chance (average

*d*′ = 0.20 ± 0.090; one-sided

*t*-test against zero,

*t*(4) = 2.25;

*p*= 0.0436). This indicates that even small changes in latent space direction were detected by the observers.

*M*±

*SEM*) at the largest turn.

*d*′ = −0.16 ± 0.29

*M*±

*SEM*; Figure 7) and were moderately sensitive to turns in Fourier space paths (average

*d*′ = 1.09 ± 0.24). However, sensitivity to turns in paths through the GAN's latent space was highest (average

*d*′ = 1.89 ± 0.20). The pairwise comparisons between performance for pixel space versus Fourier space paths (

*p*< 0.05, permutation test) and performance for Fourier space paths versus latent space paths (

*p*< 0.05, permutation test) were both statistically significant with the observed configuration having the largest difference of all permutations. This confirms that sensitivity to directions in latent space is a specific property of the representation recovered by generative adversarial networks.

*Journal of Vision*, 14 (8): 22, 1–38, https://doi.org/10.1167/14.8.22.

*Behavior Research Methods*, 40 (3), 735–743.

*Wasserstein GAN*. https://arXiv:1701.07875.

*Journal of the Royal Statistical Society*.

*Series B (Methodological)*, 57 (1), 289–300.

*Proceedings of SPIE, Human Vision and Electronic Imaging XII: Vol. 6492*(p. 64920A). Bellingham, WA: SPIE.

*Journal of Vision*, 10 (2): 23, 1–15, https://doi.org/10.1167/10.2.23.

*International Journal of Computer Vision*, 59 (2), 167–181.

*Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete*, 57 (4), 453–476.

*Nature Neuroscience*, 14, 1195–1201.

*PLoS One*, 3 (2): e1675.

*Journal of Vision*, 13 (9), 119–119.

*Journal of Vision*, 11 (6): 16, 1–19. https://doi.org/10.1167/11.6.16.

*Advances in neural information processing systems 28*(pp. 262–270). Red Hook, NY: Curran Associates, Inc.

*Annual Review of Psychology*, 59, 167–192.

*PLoS Computational Biology*, 9 (1), e1002873, https://doi.org/10.1371/journal.pcbi.1002873.

*Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Journal of Machine Learning and Research, Vol*. 15 (pp. 315–323). Fort Lauderdale, FL: PMLR.

*Advances in neural information processing systems*27 (pp. 315–323). Red Hook, NY: Curran Associates.

*Signal detection theory and psychophysics*. New York, NY: Wiley.

*Improved training of Wasserstein GANs*. https://arXiv:1704.00028.

*Proceedings of the 2015 IEEE International Conference on Computer Vision*(pp. 1026–1034). Washington, DC: IEEE Computer Society.

*International Conference on Learning Representations*. https://arXiv:1702.08431v4.

*Neuron*, 76, 1210–1224.

*Proceedings of the 32nd International Conference on Machine Learning, Journal of Machine Learning Research: Vol. 37*(pp. 448–456).

*Vision Research*, 46, 2535–2545.

*PLoS Computational Biology*, 10 (11), e1003915.

*International Conference on Learning Representations*. https://arXiv:1412.6980v9.

*Learning multiple layers of features from tiny images*[Technical Report]. Toronto, Canada: University of Toronto.

*Vision Research*, 46, 3098–3104.

*International Conference on Learning Representations*. https://arXiv:1802.05957v1.

*International Journal of Computer Vision*, 40 (1), 49–71.

*International Conference on Learning Representations*. https://arXiv:1511.06434v2.

*Advances in Neural Information Processing Systems 30*(pp. 2018–2028). Red Hook, NY: Curran Associates.

*International Journal of Computer Vision*, 115 (3), 211–252.

*Optimale operatoren in der digitalen bildverarbeitung*(Unpublished doctoral dissertation). IWR, Fakultat fur Physik und Astronomie, University of Heidelberg, Heidelberg, Germany.

*Vision Research*, 122, 105–123.

*Multivariate density estimation: Theory, practice, and visualization*. New York: Wiley.

*Proceedings of the National Academy of Sciences, USA*, 114, E5731–E5740.

*Annual Review of Neuroscience*, 24, 1193–1216.

*Frontiers in Psychology*, 4: 455, 1–9.

*Nature*, 381, 520–522.

*PeerJ*, 2, e453.

*Journal of Vision*, 16 (2): 4, 1–30. https://doi.org/10.1157/16.2.4.

*Journal of Vision, 12*(7); 6, 1–19. https://doi.org/10.1157/12.7.6.

*Journal of Vision*, 17 (12): 5, 1–29. https://doi.org/10.1157/17.12.5.

*Vision Research*, 46, 1520–1529.

*Journal of Vision*, 10 (4): 6, 1–27. https://doi.org/10.1157/10.4.6.

*Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9909*. New York, NY: Springer.

*d*across all trials.

_{segm}*t*(6) = −3.86,

*p*< 0.05 corrected, but not for Euclidean distance,

*t*(6) = −2.46,

*p*= 0.048 uncorrected.