Open Access
Article  |   March 2016
Testing models of peripheral encoding using metamerism in an oddity paradigm
Author Affiliations
  • Thomas S. A. Wallis
    Werner Reichardt Center for Integrative Neuroscience, Eberhard Karls Universität Tübingen, Tübingen, Germany
    Neural Information Processing Group, Faculty of Science, Eberhard Karls Universität Tübingen, Tübingen, Germany
    thomas.wallis@uni-tuebingen.de
    www.tomwallis.info
  • Matthias Bethge
    Werner Reichardt Center for Integrative Neuroscience, Eberhard Karls Universität Tübingen, Tübingen, Germany
    Bernstein Center for Computational Neuroscience, Tübingen, Germany
    Institute for Theoretical Physics, Eberhard Karls Universität Tübingen, Tübingen, Germany
    Max Planck Institute for Biological Cybernetics, Tübingen, Germany
    matthias@bethgelab.org
    www.bethgelab.org
  • Felix A. Wichmann
    Neural Information Processing Group, Faculty of Science, Eberhard Karls Universität Tübingen, Tübingen, Germany
    Bernstein Center for Computational Neuroscience, Tübingen, Germany
    Max Planck Institute for Intelligent Systems, Empirical Inference Department, Tübingen, Germany
    felix.wichmann@uni-tuebingen.de
    http://www.nip.uni-tuebingen.de/home.html
Journal of Vision March 2016, Vol.16, 4. doi:10.1167/16.2.4
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to Subscribers Only
      Sign In or Create an Account ×
    • Get Citation

      Thomas S. A. Wallis, Matthias Bethge, Felix A. Wichmann; Testing models of peripheral encoding using metamerism in an oddity paradigm. Journal of Vision 2016;16(2):4. doi: 10.1167/16.2.4.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Most of the visual field is peripheral, and the periphery encodes visual input with less fidelity compared to the fovea. What information is encoded, and what is lost in the visual periphery? A systematic way to answer this question is to determine how sensitive the visual system is to different kinds of lossy image changes compared to the unmodified natural scene. If modified images are indiscriminable from the original scene, then the information discarded by the modification is not important for perception under the experimental conditions used. We measured the detectability of modifications of natural image structure using a temporal three-alternative oddity task, in which observers compared modified images to original natural scenes. We consider two lossy image transformations, Gaussian blur and Portilla and Simoncelli texture synthesis. Although our paradigm demonstrates metamerism (physically different images that appear the same) under some conditions, in general we find that humans can be capable of impressive sensitivity to deviations from natural appearance. The representations we examine here do not preserve all the information necessary to match the appearance of natural scenes in the periphery.

Introduction
As retinal eccentricity increases, the human visual system becomes less sensitive to a host of information, including spatial contrast (Baldwin, Meese, & Baker, 2012; Kelly, 1984; Rovamo, Virsu, & Näsänen, 1978), vernier acuity (De Valois & De Valois, 1990), and spatial distortions (Bex, 2010). The perception of image structure is limited by the optics of the eye (Artal, 2014) and by visual crowding (Andriessen & Bouma, 1975; Bouma, 1970; Pelli & Tillman, 2008; Toet & Levi, 1992). These sensitivity losses are mirrored by density profiles of the retina (Curcio & Allen, 1990; Curcio, Sloan, Kalina, & Hendrickson, 2004; Watson, 2014) and the number of cortical neurons sensitive to different retinotopic locations (Cowey & Rolls, 1974; J. Freeman & Simoncelli, 2011; Gattass, Sousa, & Gross, 1988; Motter, 2009). The periphery can therefore be beneficially thought of as a compressed or lossy representation relative to the fovea (Balas, Nakano, & Rosenholtz, 2009; Geisler & Perry, 1998; Hocke, Dorr, & Barth, 2012; Perry & Geisler, 2002; Rosenholtz, Huang, & Ehinger, 2012) in that information in the input is discarded and no longer available to perception. 
What information is lost, and what is retained? Image-based models of visual processing make specific and testable predictions of the type of encoding performed by the human visual system. In an influential study, J. Freeman and Simoncelli (2011) introduced a powerful way to psychophysically test image-based models of visual processing through the creation of metamers: discarding as much information from an image as possible without changing the appearance of model-matched inputs. In this paper, we build on their approach, but we modify their psychophysical paradigm for measuring appearance matching to make it more stringent. Furthermore, we extend their approach to a larger set of arbitrary photographic images (natural scenes) as well as to two image-based models. 
Several studies argue that a significant amount of information can be discarded from images of natural scenes without affecting appearance (Geisler & Perry, 1998; Perry & Geisler, 2002; Watson, Ahumada, & Farrell, 1986). For example, the model of Geisler and Perry (1998; see also Loschky, McConkie, Yang, & Miller, 2005; Peli & Geri, 2001; Peli, Yang, & Goldstein, 1991; Séré, Marendaz, & Hérault, 2000) capitalizes on the blur of the optical transfer function and reduced peripheral sampling by progressively discarding high spatial frequency content as distance from the intended fixation point increases. If the discarded spectral content remains outside the range of the human contrast sensitivity function for all eccentricities, then the blurred and original images should appear identical. That is, the model discards high frequency image content that should be imperceptible to the observer. The resulting images should be metameric for the originals: physically different but perceptually the same. 
A second set of features argued to match the peripheral appearance of scenes are those describing a parameteric model of visual texture (Portilla & Simoncelli, 2000). Balas et al. (2009) found that performance decrements in object identification caused by peripheral crowding are correlated with performance decrements in synthetic stimuli generated from the texture model to match those crowding displays, suggesting that the information discarded by the texture model and information lost in crowding may be similar. J. Freeman and Simoncelli (2011) showed that two synthetic images matched under a more complex eccentricity-scaled, foveated extension of the texture model were indiscriminable from each other under certain conditions. The J. Freeman and Simoncelli model can therefore generate metamers when all other potential cues are minimized. J. Freeman and Simoncelli's model of feature combination has been argued to provide a physiologically plausible account of transformations instantiated in V1, V2, and V4 of the primate visual system (J. Freeman & Simoncelli, 2011; J. Freeman, Ziemba, Heeger, Simoncelli, & Movshon, 2013; Movshon & Simoncelli, 2014; Okazawa, Tajima, & Komatsu, 2015). 
The logic of metamerism as a criterion for testing models of scene perception (J. Freeman & Simoncelli, 2011) is that if a candidate model captures all the image features important for appearance under some condition, then two images that produce the same model output (i.e., the same values for all features the model encodes) should also appear identical to a human observer (Figure 1). A candidate model can be used to predict which images should be metamers. This logic allows the simultaneous experimental testing of a large number of image-based cues (that is, all cues encoded by a given model), complementing approaches to natural scene perception in which at most a handful of cues are manipulated at once (e.g., Alam, Vilankar, Field, & Chandler, 2014; Bex, 2010; Bex, Mareschal, & Dakin, 2007; Bex, Solomon, & Dakin, 2009; Bradley, Abrams, & Geisler, 2014; Dorr & Bex, 2013; Haun & Peli, 2013; McDonald & Tadmor, 2006; Tadmor & Tolhurst, 1994; Thomson, Foster, & Summers, 2000; To, Gilchrist, Troscianko, & Tolhurst, 2011; Vilankar, Golden, Chandler, & Field, 2014; Vilankar, Vasu, & Chandler, 2011; Wallis & Bex, 2012; Wallis, Dorr, & Bex, 2015; F. Wichmann, Drewes, Rosas, Gegenfurtner, 2010; F. A. Wichmann, Braun, & Gegenfurtner, 2006). Because a potentially large set of features are available in natural scenes, and humans are very sensitive to many of them (Balas & Conlin, 2015; Emrith, Chantler, Green, Maloney, & Clarke, 2010; Gerhard, Wichmann, & Bethge, 2013; F. A. Wichmann et al., 2006), demonstrating that human observers cannot discriminate modified (but model-matched) images from the original image is a strong test of the degree to which the transformations from retinal input to perception are captured by a candidate model. If images producing matching model responses also look the same, then any image properties not encoded by the model are perceptually unimportant. 
Figure 1
 
Model predictions and metamerism. (A) Eight physically different images lie at different points in pixel space (in which each dimension corresponds to the intensity of a pixel; two pixels are shown). (B) When expressed in model space (in which dimensions are model parameters), the images lie on two points; that is, they have identical model responses. The images linked by the yellow curve in pixel space have one model response; the images linked by the blue curve have another. If the model corresponds to perception, then images with identical model responses should appear the same despite being physically different. Two of the images above are natural scene patches, and the rest are synthesized texures generated to match those patches.
Figure 1
 
Model predictions and metamerism. (A) Eight physically different images lie at different points in pixel space (in which each dimension corresponds to the intensity of a pixel; two pixels are shown). (B) When expressed in model space (in which dimensions are model parameters), the images lie on two points; that is, they have identical model responses. The images linked by the yellow curve in pixel space have one model response; the images linked by the blue curve have another. If the model corresponds to perception, then images with identical model responses should appear the same despite being physically different. Two of the images above are natural scene patches, and the rest are synthesized texures generated to match those patches.
Of course, the criterion of behavioral indiscriminability alone is not enough for progress. A model that discards no information from the source images will produce matching images that are indiscriminable. Second, showing that two modified images are indiscriminable from one another (as done by J. Freeman & Simoncelli, 2011; J. Freeman et al., 2013) does not necessarily mean that the model features can match natural scene appearance. For example, two white noise samples matched to the mean luminance and contrast of a natural scene may be hard to discriminate from each other but easy to discriminate from the scene. A “blind” model that discarded all visible information would produce images that were metameric for each other but would no doubt fail to match scene appearance. 
Therefore, the model-testing goal is to find the most parsimonious set of features that produce appearance matches to arbitrary natural scenes. If two models produce perceptually lossless transformations of arbitrary images, then the model whose representation is more compressed provides the most parsimonious account of the information that remains perceptually important. The larger model must retain some information that could be discarded without affecting appearance. Comparing and refining models in this fashion could be used to design increasingly lean descriptions of the features that are perceptually important for scene appearance. 
To rigorously test these criteria, we want to maximize the sensitivity of human observers to differences between real and modified images. The experimental paradigm should ensure that any difference in appearance between the images (caused because some cues not captured by the candidate model are perceptually important) is reflected in behavioral sensitivity. Conversely, we want to minimize the influence of limits imposed by cognitive factors, such as memory or attention, to ensure that images that are behaviorally indiscriminable are so because they appear identical. Finally, a forced-choice paradigm (Blackwell, 1952; Jäkel & Wichmann, 2006) is preferable to a rating method (as used in e.g., To et al., 2011) to measure objective sensitivity. 
In this paper, we use a three-alternative temporal oddity paradigm, in which observers indicate which of three consecutively presented images was different from the other two (the “oddball”). Variants of this paradigm have been applied in similar settings before (e.g., Balas, 2012; Balas, 2006; Tadmor & Tolhurst, 1994; Thomson et al., 2000; Tolhurst & Tadmor, 1997). Unlike a two-alternative forced choice (2AFC), an oddity task allows the combination of several target categories (e.g., original vs, modified image or two synthesized examples against one another) using the same instruction set. An oddity task has a minimum of three response alternatives. Because chance performance in this oddity task is 33% rather than 50% (as in 2AFC or ABX, for example), the design has more power to detect deviations from chance (because 0.5 is the proportion with highest binomial variability, see e.g., Jäkel & Wichmann, 2006).1 Although we are ultimately interested in instantaneous spatial appearance matches (as for metamerism in color vision), we chose a temporal oddity task because it is difficult to spatially compare three or more alternatives while maintaining the surrounding scene context (see below) due to screen size. 
In this paper, we apply the approach outlined above to two of the models discussed. We refer to the model of Geisler and Perry (1998; Perry & Geisler, 2002) throughout the paper as the blur model and the Portilla and Simoncelli (2000) texture synthesis features as the texture synth model. Note that the texture synth model scrambles the content within a local region using Portilla and Simoncelli's algorithm without the additional constraints of the overlapping pooling regions as imposed by J. Freeman and Simoncelli (2011), a point we return to in the General discussion. To preview our results, we find that human observers are highly sensitive to differences between original and modified images. 
Experiment 1: Image blur
In Experiment 1, we quantify the amount of image blur required for observers to discriminate a peripherally presented image patch from an unmodified original image. We apply Gaussian blur to the whole image patch, meaning that our stimuli are only an approximation to the Perry and Geisler (2002) model (which contains a gradient of blur as retinal eccentricity increases). Nevertheless, we believe this is a reasonable approximation for the eccentricity and patch sizes we employ here.2 
The discriminability of image patches depends on their spatial extent. To examine the influence of stimulus size, in this experiment, we tested two sizes of target image patches. Furthermore, the context in which a target patch is presented is likely to affect discriminability, either negatively via contrast gain control (Bex et al., 2007; Geisler & Albrecht, 1992; Heeger, 1992) and crowding (Levi, 2008; Wallis & Bex, 2012) or positively through mechanisms such as contour facilitation or contextual guidance (Larson & Loschky, 2009; Neri, 2011, 2014; Oliva & Torralba, 2006). To examine the effect of scene context, we measured the discriminability of patches with and without surrounding structure included. 
Methods
All stimuli, data, and code to reproduce the figures and statistics reported in this paper are provided online (Raw data and stimuli at http://doi.org/10.5281/zenodo.32784, code at http://doi.org/10.5281/zenodo.34218). 
Observers
Three observers (one female) participated in the experiment (ages 23, 25, and 32). S1 is an author; other observers were recruited from our lab group. Throughout this paper, we maintain the same anonymous code for each individual (i.e., S1 is the same observer in all experiments). The numbers are not consecutive due to participation in related experiments not reported in this paper. Observers had normal or corrected-to-normal acuity. Observers provided written consent to participate, and all other procedures met the Declaration of Helsinki. 
Apparatus
Stimuli were displayed on a VIEWPixx LCD (VPIXX Technologies; spatial resolution 1920 × 1200 pixels, temporal resolution 120 Hz). Outside the stimulus image, the monitor was set to mean gray. Observers viewed the display from 60 cm (maintained via a chin rest) in a darkened chamber. At this distance, pixels subtended approximately 0.023° on average (43 pixels/° of visual angle). The monitor was carefully linearized (maximum luminance 212 cd/m2) using a Gamma Scientific S470 Optometer. Stimulus presentation and data collection were controlled via a desktop computer (12 core i7 CPU, AMD HD7970 graphics card) running Kubuntu Linux (14.04 LTS), using the Psychtoolbox Library (Kleiner, Brainard, & Pelli, 2007, version 3.0.12) and our internal iShow library (http://dx.doi.org/10.5281/zenodo.34217) under MATLAB (The Mathworks, Inc., R2013B). Gaze position was monitored via an Eyelink 1000 desk-mounted video eye tracker. Following a five-point calibration and validation routine, the left eye position was sampled at 250 Hz. 
Stimuli
We cropped three adjacent square image patches from source images (see Figure 2). The middle patch was always cropped from the center of the 768 × 768 pixel image and varied in size. Middle image patches were square crops with side length 32 or 256 pixels (red circles in Figure 2A and B, respectively). The other two patches abutted the middle patch on the left (inner) and right (outer) side respectively and were always 256 × 256 pixels. 
Figure 2
 
Experiment 1 stimuli and task. (A) Three patches were cropped from a source image. The middle patch (blue circle) was always the target to be discriminated. This image depicts a target subtending 0.74° in diameter. The surround patches (inner and outer; orange circles) abutted the target patch and always subtended ≈6°. The middle patch was always centered at 10° to the right of the fixation spot. The angle between the patches and the fixation spot was randomized in each interval to make motion cues uninformative (white dashed arrows; see text). (B) Illustration of the second target size condition (subtending ≈6°). (C) Different blur kernels (standard deviation in pixels) applied to the target patch from (B). (D) Procedure. Gray rectangle depicts the monitor; central white point is the fixation spot. The stimulus was presented with the middle (target) patch centered 10° in the right visual field. One interval contained a physically different target patch to the other two; observers indicated which interval contained the oddball. In this example, the blurred image is the oddball and occurs in interval 3. (E) Depiction of the surround condition. Only the right half of the monitor and the three stimulus intervals are shown to improve visibility. The surround patches are the same image in each interval. In this example, the oddball image is the unmodified patch and occurs in the second interval.
Figure 2
 
Experiment 1 stimuli and task. (A) Three patches were cropped from a source image. The middle patch (blue circle) was always the target to be discriminated. This image depicts a target subtending 0.74° in diameter. The surround patches (inner and outer; orange circles) abutted the target patch and always subtended ≈6°. The middle patch was always centered at 10° to the right of the fixation spot. The angle between the patches and the fixation spot was randomized in each interval to make motion cues uninformative (white dashed arrows; see text). (B) Illustration of the second target size condition (subtending ≈6°). (C) Different blur kernels (standard deviation in pixels) applied to the target patch from (B). (D) Procedure. Gray rectangle depicts the monitor; central white point is the fixation spot. The stimulus was presented with the middle (target) patch centered 10° in the right visual field. One interval contained a physically different target patch to the other two; observers indicated which interval contained the oddball. In this example, the blurred image is the oddball and occurs in interval 3. (E) Depiction of the surround condition. Only the right half of the monitor and the three stimulus intervals are shown to improve visibility. The surround patches are the same image in each interval. In this example, the oddball image is the unmodified patch and occurs in the second interval.
Images were sourced from the MIT 1003 database (Judd, Durand, & Torralba, 2012; Judd, Ehinger, Durand, & Torralba, 2009; see http://people.csail.mit.edu/tjudd/WherePeopleLook/index.html), which consists of 1,003 jpeg images compiled from Flickr creative commons, and from the LabelMe database (Russell, Torralba, Murphy, & Freeman, 2008) and contains 779 landscapes and 228 portraits as well as several nonnatural images (visual search displays). These original images were converted to gray scale using scikit-image's rgb2gray function (van der Walt et al., 2014) and standardized to have a mean intensity of 0.5 (on a [0, 1] scale) and an RMS contrast (SD/mean) of 0.3.3 Note that these values refer to the global mean and contrast of the larger source image, not the local mean and contrast of the resulting patches. After discarding images that had at least one dimension smaller than 768 pixels (n = 362), we pseudorandomly4 selected 100 different images for each of the two middle patch sizes. 
For each unique image, we created blurred middle patches using the gaussian_filter function from scikit-image (Figure 2C), by blurring the 768 × 768 source image for each of eight blur levels (pixel SDs of 0.5, 1, 2, 3, 5, 8, 12, and 16) and then cropping the middle patch from the center of this larger blurred image as for the original patch above. Blurring and then cropping ensures that border effects resulting from the convolution do not affect the middle patch. Gaussian blurring produces attenuation of high spatial frequencies as in the Geisler and Perry (1998) model and avoids the noticeable ringing artifacts produced by ideal low-pass filters. 
To ensure that observers could not perform the discrimination based on a mean luminance change, we renormalized the mean pixel intensity of the blurred middle patch to the mean pixel intensity of the unmodified middle patch. Contrast was not renormalized because blurring can be understood as reducing the contrast of the image preferentially for higher spatial frequencies. 
Each of the resulting 2,200 patches (200 unique images, each with an unmodified middle, inner, and outer patch, plus 1,600 blurred middle patches) was then windowed in a circular aperture with a cosine profile, smoothly blending the contrast of the image patch from its unaltered values down to zero (background gray) in the space of 10 pixels and then saved to an 8-bit .png file. 
Procedure
This experiment tested observers' ability to discriminate unmodified from blurred image patches in a temporal three-alternative oddity paradigm (Figure 2D). Observers fixated a spot in the center of the screen (best for steady fixation from Thaler, Schütz, Goodale, & Gegenfurtner, 2013) while three consecutive middle patches (see above) were presented centered 10° of visual angle to the right of fixation. One image patch (the oddball) was different from the other two (which were the identical image repeated) and could appear first, second, or third with equal probability (counterbalanced). Each image patch was presented for 150 ms at full contrast with three frames preceding and following this interval used to smoothly ramp the contrast on/off approximating a cosine window. The interstimulus interval was 500 ms. After the third image patch was presented, the observer was given 1200 ms to respond before the next trial commenced. If no response was registered, it was recorded as missing. Feedback was provided by a change in fixation cross brightness (for 100 ms) and a low beep for an incorrect response. Prior to the next trial, the state of the observer's eye position was monitored for 50 ms; if the eye position was reported as more than 2° away from the fixation spot, a recalibration was triggered. 
To reduce the possibility that observers could use motion signals to discriminate the oddball patch, the angle of the patches to the fixation spot was randomly selected from a uniform distribution of ±0.1 radians, centered on the horizontal meridian, in each temporal interval. This angular offset produced a maximal tangential movement of ±1° of visual angle at the middle patch eccentricity of 10°. The image patches were rotated (using Psychtoolbox) about their midpoint to compensate for the angular offset, such that contours extending across the inner, middle, and outer patches remained collinear. 
The experiment was constructed in a fully crossed 2 (middle patch size) × 2 (context condition) design. The middle patch size could be either 32 or 256 pixels in diameter, corresponding to approximately 0.75° and 6° of visual angle. In addition, the middle patch could be presented either alone on a gray background (no surround condition; Figure 2D) or with the inner and outer image patches (surround condition; Figure 2E). When surround patches were shown, they were identical in every temporal interval and so could change task performance only to the extent that they changed observers' ability to discriminate the middle patches. The oddball could be either the unmodified or blurred middle patch (counterbalanced within a block; randomly permuted). 
Trials of the same middle patch size were blocked together to allow observers to deploy spatial attention at the appropriate size (Herrmann, Montaser-Kouhsari, Carrasco, & Heeger, 2010). Surround conditions and blur level were also blocked (i.e., a block consisted of 100 trials). Blocks were completed in an arbitrarily chosen order. Because we used unique images for every middle patch size, familiarity effects could not transfer between sizes. A break screen was shown after 50 trials, and it also told the observer his or her mean performance in the previous block. Each block took approximately 7 min to complete. Observers completed a variable number of blur levels (see Figure 3). All observers had significant psychophysical experience in related experiments, so we did not include practice trials. 
Figure 3
 
Experiment 1 results and model fits for three observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of the standard deviation of the blur kernel (pixels). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Patch size (diameter in degrees) is coded in green for the large patch and blue for the small patch condition. Dashed gray line shows chance performance (0.33). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Observers can discriminate images with little high spatial frequency loss from natural images with above chance accuracy, and large patches are easier than small patches.
Figure 3
 
Experiment 1 results and model fits for three observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of the standard deviation of the blur kernel (pixels). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Patch size (diameter in degrees) is coded in green for the large patch and blue for the small patch condition. Dashed gray line shows chance performance (0.33). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Observers can discriminate images with little high spatial frequency loss from natural images with above chance accuracy, and large patches are easier than small patches.
Data analysis
How much can the middle patch be blurred before it fails to match the appearance of the original? To quantify the largest blur at which the unmodified and blurred images are perceptually matched, we adapted the model fitting logic used by J. Freeman and Simoncelli (2011). Discriminability (d′) as a function of stimulus level (in this case, blur) is parameterized with a “critical stimulus level” below which d′ is zero (i.e., the patches are not discriminable at or below this blur). We used a three-parameter function consisting of a critical stimulus level, a maximum sensitivity (gain), and a slope parameter that controlled the rate of performance improvement for stimulus levels between the critical stimulus value and the maximum. We estimated the posterior distribution of these model parameters simultaneously for all observers and conditions using a multilevel model. This procedure has a number of advantages over traditional methods, one of which is that uncertainty is preserved through all stages of inference (see Appendix 1 for details). 
Before fitting the data, we discarded trials in which the observers' eyes moved further than 2° from the fixation spot or the observer blinked during any of the three 200-ms image intervals. We excluded trials with eye movements or blinks to ensure that the nominal viewing conditions reported above corresponded somewhat to reality; the overall pattern of results was much the same with these trials included. We discarded 0.05% of trials for S1 (a single trial), 7.3% for S4, and 0.7% for S10. In addition, we used the eye tracker time stamps to discard trials with timing errors caused by dropped frames or other irregularities (45 trials). The final data set consisted of 1,999 trials from S1, 1,668 trials from S4, and 2,285 trials from S10. 
Finally, before fitting and plotting the data, we applied a rule-of-succession correction to each cell of binomial trials by adding one success and one failure. This ensures that graphical depictions of the data do not misrepresent the uncertainty in data points with few observations and hardly affects the model fits. 
Results
Discrimination performance as a function of blur level is shown in Figure 3. For most experimental conditions, our observers could tell the difference between blurred image patches and their unmodified source images for even small amounts of blur. Across all observers and conditions, it was easier to discriminate blurred from unmodified images for the large patch sizes (approximately 6° in diameter) than for small patch sizes (0.74° in diameter). Including scene context information in surrounding patches (second column, Figure 3) made the task more difficult. For the large patch sizes, this effect is subtle whereas including context for small target patches made the task essentially impossible. Our three-parameter discrimination function (curves in Figure 3) adequately captures the relationship between blur level and performance across conditions. 
Parameter estimates for the model fits are shown in Figure 4 as violin plots (kernel density smoothing mirrored about the vertical axis). Thicker areas of the violins show greater sample density (more probability mass). The median of the samples is denoted by the black horizontal lines. The violin distributions have been truncated at the 2.5th and 97.5th percentiles of the distribution (i.e., the extent of the violin shows the 95% credible interval of the parameter). For the posterior distributions (solid colors), 95% credible intervals show the range with a 95% probability to contain the “true” parameter value given the data, model, and priors. All posterior distributions (darker colors) deviate significantly from the prior distributions (lighter colors), indicating that the priors we assume for the model parameters (see Appendix 1) are not overly restrictive.5 Note that the slight variability in the widths and medians of the prior densities is due to sampling variability; all experimental conditions and subjects were given the same priors. 
Figure 4
 
Parameter estimates for Experiment 1 model fits. (A) Estimates of the critical blur parameter for each observer in each experimental condition. Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the medians of prior and posterior distributions. Note that the prior intervals extend to 16; the y-axis has been truncated to show the posteriors more clearly. Critical blur estimates are difficult to interpret due to a floor effect (see Figure 11 for more precise measurement of metamerism in the no surround large patch condition). (B) As for (A) for the maximum sensitivity parameter (in d′ units). Larger patches have higher asymptotic performance than small patches, and the addition of surrounds slightly reduces gain. (C) As for (A) for the slope parameter.
Figure 4
 
Parameter estimates for Experiment 1 model fits. (A) Estimates of the critical blur parameter for each observer in each experimental condition. Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the medians of prior and posterior distributions. Note that the prior intervals extend to 16; the y-axis has been truncated to show the posteriors more clearly. Critical blur estimates are difficult to interpret due to a floor effect (see Figure 11 for more precise measurement of metamerism in the no surround large patch condition). (B) As for (A) for the maximum sensitivity parameter (in d′ units). Larger patches have higher asymptotic performance than small patches, and the addition of surrounds slightly reduces gain. (C) As for (A) for the slope parameter.
Observers are able to discriminate the patches above chance from small blur levels (Figure 4A). There is no strong evidence that the critical blur level varies greatly across conditions or patch sizes. The medians remain at approximately the same level for each subject irrespective of condition, and parameter difference scores were all centered on zero with no credible differences (not shown). We relate these results to the change in spectral content induced by Gaussian blur and the human contrast sensitivity function below. 
There is a large difference in the maximum sensitivity (gain) observers attain across the experimental conditions (Figure 4B). Gain in the large patch size conditions approached a d′ of 5 whereas sensitivity in the small patch size condition with no surround remains about half that value (see Appendix 2 for a complementary analysis of gain effects using Bayes factors). We explore the extent to which these average performances hold for individual scenes in Appendix 5. Sensitivity in small patches with surrounds credibly exceeds zero, but only barely. This corroborates our observation above that in this condition the task was essentially impossible. These differences are quantified in Figure 5 by taking the difference in the gain parameter for each Markov chain Monte Carlo (MCMC) sample (no surround − surround). Positive difference scores correspond to higher sensitivities in the no surround condition. Corroborating the estimates from Figure 4, it is clear that observers were more sensitive with no surround patches for small patch sizes. Although there is some evidence that this pattern holds for the larger patch sizes, these differences are not credible at the 95% level. Finally, slopes in the large patch sizes tended to be steeper than for small patch sizes. 
Figure 5
 
Difference in gain (maximum performance in d′ units) between the no surround and surround conditions. Computed by subtracting the surround from the no surround condition; positive scores mean higher sensitivity in the no surround condition. Patch sizes are colored as in Figure 3. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is reliably higher in the no surround condition than the surround condition for small patches but not for large patches.
Figure 5
 
Difference in gain (maximum performance in d′ units) between the no surround and surround conditions. Computed by subtracting the surround from the no surround condition; positive scores mean higher sensitivity in the no surround condition. Patch sizes are colored as in Figure 3. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is reliably higher in the no surround condition than the surround condition for small patches but not for large patches.
Spectral content and filter cutoffs
Geisler and Perry (1998) based the parameters of their foveated blurring model on detection thresholds for peripherally presented grating stimuli. The contrast threshold for small flashed grating stimuli of some frequency and retinal eccentricity was estimated as  where f is spatial frequency in cycles per degree, e is the retinal eccentricity (degrees), CT0 is the minimum contrast threshold, α is a spatial frequency decay constant, and e2 is the eccentricity at half resolution. Geisler and Perry report that this function gives a good description of the results of Arnow and Geisler (1996); Banks, Sekuler, and Anderson (1991); and Robson and Graham (1981). By setting the left side of the function to 1.0 (the maximum contrast), one can solve for the critical eccentricity beyond which a given frequency will fall below threshold or, conversely, the upper frequency limit at a given retinal eccentricity.  
The large patches in our Experiment 1 spanned retinal eccentricities from approximately 7° to 13°. Using the parameters reported by Geisler and Perry (1998; α = 0.106, e2 = 2.3, and CT0 = 1/76), we determined that the critical frequencies resolvable within this range of eccentricities were 10 c/° at 7° and 6 c/° at 13° (lighter gray shaded region of Figure 6). That is, observers may be able to detect amplitude attenuations within the light gray zone but should be insensitive to amplitude attenuation for spatial frequencies larger than 10 c/° (darker gray shaded region of Figure 6). 
Figure 6
 
Change in spectral power caused by blurring in Experiment 1 for 256-pixel square images. Log amplitude (base 10, arbitrary units) as a function of spatial frequency (cycles per degree) for unmodified images and three blur levels chosen according to perceptual discriminability (see legend). In addition, we examined the attenuation caused by processing our stimuli using the Geisler and Perry (1998) foveated blurring model (red curve). Amplitudes are averaged over images and orientations. The light gray shaded region shows the frequency range corresponding to Geisler and Perry's critical frequency cutoff for the inner and outer borders of our stimuli. The darker gray shaded region denotes frequencies that should be undetectable in our experiment. These comparisons suggest the Geisler and Perry foveated stimulus blur would be readily discriminable (d′ > 3) by observers in our experiment.
Figure 6
 
Change in spectral power caused by blurring in Experiment 1 for 256-pixel square images. Log amplitude (base 10, arbitrary units) as a function of spatial frequency (cycles per degree) for unmodified images and three blur levels chosen according to perceptual discriminability (see legend). In addition, we examined the attenuation caused by processing our stimuli using the Geisler and Perry (1998) foveated blurring model (red curve). Amplitudes are averaged over images and orientations. The light gray shaded region shows the frequency range corresponding to Geisler and Perry's critical frequency cutoff for the inner and outer borders of our stimuli. The darker gray shaded region denotes frequencies that should be undetectable in our experiment. These comparisons suggest the Geisler and Perry foveated stimulus blur would be readily discriminable (d′ > 3) by observers in our experiment.
We selected three blur levels based on their behavioral effects in our experiment: the critical blur (median across observers 0.92 pixel SD, see Figure 4), a blur leading to a d′ of 1 (median 1.22 pixels), and a d′ of 3 (median 2.12 pixels). Under our viewing conditions, these kernel standard deviations correspond to 0.021°, 0.028°, and 0.049°, respectively. The frequency at which the spectral content is 50% smaller than its maximum (i.e., the full width at half maximum) for each filter in frequency space is 17.6, 13.2, and 7.6 c/°. Thus, a substantial amount of the attenuation caused by the filters falls well above the critical frequencies estimated above. Depending on the size of the contrast decrement required for discrimination, this comparison suggests that observers are sensitive to spatial frequency information higher than the cutoffs in the Geisler and Perry (1998) model—at least for complex, broadband stimuli such as natural images. 
To compare our results to the Geisler and Perry (1998) model, in Figure 6, we show the spectral content of our images for the blurs we used and for the foveated model of Geisler and Perry. The average log amplitude of the original images (in the 5.95° patch condition) is shown in black and roughly follows the ≈1/f linear (in log–log) falloff typical for natural scenes. The blue curve shows the average spectral content of the scenes after blurring with the critical blur kernel from our experiments; that is, these attenuations are the most the image can be filtered before observers begin to notice. The red curve shows the average spectral content after applying the foveated model of Geisler and Perry.6 The attenuation caused by this model is stronger than the critical blur levels from our experiment, stronger in fact than attenuations leading to d′ values of 3 (see Figure 6). This result implies that images filtered via the foveated model of Geisler and Perry would be easily discriminable from unmodified natural scenes in our experiment. 
Experiment 2: Texture syntheses
The blur model examined in Experiment 1 is a lossy linear transformation of the input image. Now, we consider a lossy nonlinear transformation in the form of the parametric texture model of Portilla and Simoncelli (2000). We measured the degree to which observers could discriminate between synthesized images and their source images. This experiment bears several similarities to that of J. Freeman and Simoncelli (2011), and indeed, the texture synth features are equivalent to those computed in their midventral model. The two primary differences are, first, that in our stimuli the features are matched over the entire image patch (as in J. Freeman et al., 2013) whose size is varied rather than scaled with overlapping pooling regions as a function of retinal eccentricity as in J. Freeman and Simoncelli. Second, in this experiment, we assess the discriminability of the synthesized images from the original source images whereas observers in J. Freeman and Simoncelli compared two synthesized images to each other. To examine the effect of these comparisons in the context of our stimuli, we had observers perform both natural versus synthesized (as in Experiment 1) and synthesized versus synthesized (as in J. Freeman & Simoncelli, 2011) comparisons. 
The Portilla and Simoncelli (2000) parametric texture model
The texture synth model is based on the output of band-limited filters tuned to multiple scales and orientations (W. T. Freeman & Adelson, 1991; Heeger & Bergen, 1995). This architecture is appealing for the study of human vision because a similar framework of band-limited channels is thought to underlie early visual processing in humans (a point made by Balas, 2006; see Blakemore & Campbell, 1970; Campbell & Robson, 1968; Graham & Nachmias, 1971). The full texture synth model can consist of 1,000 parameters or more (depending on the choice of the number of orientation bands, scale bands, and neighborhood size) that can be grouped into four sets. The first set of parameters summarizes the histograms of image intensities; it captures the relative amounts of each intensity in the image and ensures that the overall contrast and brightness of the images are matched (i.e., the marginal pixel statistics). The second set of parameters captures the autocorrelations between low-pass versions of the stimulus; here, globally oriented or periodic patterns are captured (raw autocorrelations). The third set of parameters is the product of the filter responses at different positions, orientations, and scales (magnitude correlations). These statistics capture large-scale lines and edges in the original image. The final set of parameters captures the relative phase of the band-limited channels between spatial scales to ensure that the contrast polarity of image structure is preserved (phase correlations). 
Once these parameters are determined for a target image, new images can be synthesized that match the statistics of the original. Starting with a white noise image, the intensities are iteratively adjusted to match the statistics of the target image via a gradient descent/projection procedure. The resulting synthesized image has the same model response (i.e., the same coefficients) as the original image but is very likely to be physically different. Portilla and Simoncelli (2000) presented numerous examples of these synthesized textures along with qualitative visual demonstrations that the four sets of parameters are independently required for subjectively appealing texture synthesis. 
The psychophysical properties of the texture model have been most extensively tested by Balas and colleagues (Balas, 2006, 2008, 2012; Balas & Conlin, 2015). The most relevant of these to the present study is Balas (2006), in which the author performed a psychophysical comparison (spatial three-alternative oddity) between synthesized texture images and their original texture images, investigating the effect of “lesioning” one or more of the parameter sets by excluding it from the synthesis procedure. The results showed that the marginal statistics (histogram-matching parameters) were the most important parameters to include and that the importance of autocorrelation and magnitude correlations depended on the type of texture examined. J. Freeman et al. (2013) found similar results in hundreds of human observers in an online crowdsourced experiment (but note that they compared synthesized patches to phase-randomized noise rather than to the original images). Failures to produce exact metamerism can therefore be useful in quantifying the relative importance of model parameters. 
More generally, texture synth features can admirably match the appearance of natural textures under certain conditions (Balas, 2006, 2012; although see Balas & Conlin, 2015), capture features that can discriminate animal from nonanimal scenes (Crouzet & Serre, 2011), and are correlated with the degree of performance impairment in visual crowding (Balas et al., 2009; J. Freeman & Simoncelli, 2011). 
Methods
Four observers (ages 22, 22, 25, and 32; two female) contributed data to the analysis. An additional observer was recruited but withdrew after completing one block; we discarded her data from further analysis. All observers were recruited from a university student mailing list and paid 10 Euro per hour for their participation with the exception of S1, an author. Observers had normal or corrected-to-normal acuity. 
Stimuli
As in Experiment 1, three adjacent square image patches (see Figure 7) were cropped from source images derived from the MIT1003 database. The middle patch was always cropped from the center of the 768 × 768 pixel image and varied in size. In this experiment, we varied the size of the central patch because this is similar to varying the size of a single pooling region (in degrees subtended) for the texture synth model (see General discussion). Specifically, middle image patches were square crops with side length 32, 64, 128, 192, 256, 384, and 512 pixels (concentric colored circles in Figure 7A). Because the texture synth algorithm requires power-of-two source images, the 192- and 384-pixel patches were generated by first synthesizing 256- and 512-pixel patches and then cropping them (see below). The other two patches abutted the middle patch on the left (inner) and right (outer) side, respectively, and were always 256 × 256 pixels. For the smallest five middle-patch sizes, the inner and outer patches were cropped from adjoining regions of the source image, and for the largest two sizes, the inner and outer crop regions overlapped with the middle patch crop region. 
Figure 7
 
Experiment 2 stimuli (procedure as in Experiment 1). (A) Three patches were cropped from a source image. The middle patch was always the target to be discriminated and varied in size (concentric blue/green/yellow colored circles), white labels show diameter in degrees. The surround patches (inner and outer; orange circles) shifted position so that they always abutted the middle patch (orange arrows). Here they are shown at their largest distance, such that the largest two target patches were partially occluded by the surrounds. The middle patch was always centered at 10° to the right of the fixation spot. (B) Natural inner, middle, and outer patches cropped from the example image in (A). (C) Synthetic patches matching those shown in (B) under the Portilla and Simoncelli (2000) texture representation. (D) A depiction of a nat versus synth trial with no surround. (D–G) Depictions of four experimental conditions (six were run in total; see text). The correct response in all examples in this figure is “interval 3.”
Figure 7
 
Experiment 2 stimuli (procedure as in Experiment 1). (A) Three patches were cropped from a source image. The middle patch was always the target to be discriminated and varied in size (concentric blue/green/yellow colored circles), white labels show diameter in degrees. The surround patches (inner and outer; orange circles) shifted position so that they always abutted the middle patch (orange arrows). Here they are shown at their largest distance, such that the largest two target patches were partially occluded by the surrounds. The middle patch was always centered at 10° to the right of the fixation spot. (B) Natural inner, middle, and outer patches cropped from the example image in (A). (C) Synthetic patches matching those shown in (B) under the Portilla and Simoncelli (2000) texture representation. (D) A depiction of a nat versus synth trial with no surround. (D–G) Depictions of four experimental conditions (six were run in total; see text). The correct response in all examples in this figure is “interval 3.”
After discarding images that had at least one dimension smaller than 768 pixels (n = 362), the images were converted to gray scale (using scikit-image's rgb2gray function) and standardized to have a mean intensity of 0.5 (on a [0, 1] scale) and an RMS contrast (SD/mean) of 0.3. Each image was then assigned to one of five presynthesis middle patch sizes of 32, 64, 128, 256, and 512 pixels. Double the number were assigned to the largest two sizes to allow cropping to intermediate sizes (see above). To ensure that each inner, middle, and outer patch contained visible image content, we excluded images with an RMS contrast of less than 0.1 in any patch at a given size. Each of the resulting 1,863 patches (621 images, each with three patches) was saved to an 8-bit .png file. 
We then used the publically available texture synthesis MATLAB toolbox (http://www.cns.nyu.edu/lcv/texture/) to generate synthesized images matched to our source patches under the Portilla and Simoncelli (2000) texture representation. The texture synth representation we used consisted of four orientations, a spatial neighborhood of 9 pixels, and either three (for patches of size 32) or four spatial scales. These are the most common settings used in the literature (e.g., Balas et al., 2009; J. Freeman & Simoncelli, 2011). For each inner and outer patch, we generated one synthesized texture using 50 iterations of the gradient descent procedure whereas for middle patches we synthesized three unique textures, using 100 iterations to ensure close convergence to the statistics of the original image. Seven images were discarded because at least one patch could not be synthesized (each of 10 attempts failed to converge). The resulting images were saved to 8-bit .png files. 
Each original and synthesized inner, middle, and outer patch was then windowed in a circular aperture with a cosine profile, smoothly blending the image patch into the background in the space of 10 pixels (Figure 7B, C). To produce the 192- and 384-pixel patch sizes, we cropped the center from half of the 256- and 512-pixel patches, respectively. Because a texture is shift invariant (the absolute position of features does not matter), the PS statistics are always a property of the entire source image rather than any subregion of the image. Generating larger patches and then cropping them means that the cropped images could contain image structure not present in the correspondingly sized original image (for example, a feature outside the crop region of the original could shift into the center of the synthesized image). Therefore, synthesizing and then cropping means that the statistics of the 192- and 384-pixel patch sizes are not exactly matched to the original images of the same size. These data points do not appear to produce consistently different results from their matched counterparts (see points with black borders in Figure 8), a point we return to in the General discussion
Figure 8
 
Experiment 2 results and model fits for four observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of patch size (diameter in degrees). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Points with black borders show patch size levels created by down-sampling, which do not have precisely matched statistics (see text). Dashed gray line shows chance performance (0.33). The comparisons between natural and synthesized images are in blue (with squares) and the synthesized versus synthesized comparisons are shown in green (with circles). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Discriminating natural from synthesized images is easier than synthesized images against other synthesized images for large patch sizes; performance in the two conditions was similar (but still largely greater than chance) for the smallest patch sizes we tested. Adding surrounds reduced asymptotic performance.
Figure 8
 
Experiment 2 results and model fits for four observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of patch size (diameter in degrees). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Points with black borders show patch size levels created by down-sampling, which do not have precisely matched statistics (see text). Dashed gray line shows chance performance (0.33). The comparisons between natural and synthesized images are in blue (with squares) and the synthesized versus synthesized comparisons are shown in green (with circles). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Discriminating natural from synthesized images is easier than synthesized images against other synthesized images for large patch sizes; performance in the two conditions was similar (but still largely greater than chance) for the smallest patch sizes we tested. Adding surrounds reduced asymptotic performance.
The procedure just described resulted in 91, 91, 91, 88, 91, 81, and 81 unique source images (614 in total) for each middle patch size of 32, 64, 128, 192, 256, 384, and 512 pixels, respectively. Under our viewing conditions, these stimuli subtend approximately 0.75°, 1.5°, 3°, 4.5°, 6°, 9°, and 12° of visual angle. Each unique source image yielded four middle patches (the original plus three synthesized images), two inner, and two outer patches (original and one synthesis). 
Procedure
As in Experiment 1, we used a temporal three-alternative oddity paradigm. This experiment was constructed in a fully crossed 2 (comparison type) × 3 (context condition) design. Comparison type refers to the middle patch type compared. In nat versus synth trials, the middle patch was either the natural (source) patch or a synthesized patch matched to that natural image patch (Figure 7D and F). This condition is similar to that for blur in Experiment 1. The oddball image could be either the natural or a synthesized patch (randomly permuted). In synth versus synth trials, the middle patch could be one of two synthesized patches sourced from the same natural image, similar to J. Freeman and Simoncelli (2011) (Figure 7E and G). The identity of the oddball was randomly permuted for each experimental block, meaning that one observer could see patch A as the oddball and patch B as the foils, and for another observer, this could be reversed. To minimize the potential for learning, the synthesized patch used for nat versus synth trials was always different from the synthesized patches used in synth versus synth trials (but all were matched to the original image under the model). 
We varied the scene context similarly to Experiment 1 with an additional condition: In synth surround trials, the synthesized inner and outer patches were presented to either side of the middle patch (Figure 7G). As in Experiment 1, in no surround trials, the middle patches were presented isolated on a gray background (Figure 7D, E), and in natural surround trials, the original inner and outer patches were presented to either side of the middle patch (Figure 7F). 
Trials of the same middle patch size were blocked together. The order in which observers completed each stimulus size was determined randomly. Surround conditions were also blocked. Within each stimulus size level, observers first completed a no surround block followed by a surround natural or surround synth block in alternating order for each size. This ordering ensured that any effect of learning the middle patches should improve performance in the conditions in which surrounds were present; this is not what we found, so any learning/familiarity effect is likely to minimally affect our conclusions. Recall from our stimulus generation procedure that each unique source image was only used in one middle patch size; that is, familiarity effects could not transfer between stimulus sizes. Different comparison type trials (nat vs. synth, synth vs. synth) were interleaved within each surround condition block. Within each block, patches sourced from each unique image were repeated twice (once for each comparison type). The target appeared with equal frequency in each presentation interval across a block, ensuring that an interval bias could not produce differences between conditions. For nat versus synth trials, the oddball image was natural or synthetic with equal frequency; for synth versus synth trials, the oddball was synthetic version A or B with equal frequency. To reduce the possibility of interaction in repeated images (even though the middle patches were different), trial order was pseudorandomly determined with the constraint that trials from at least three other images separated two trials from the same source image. For size levels with 91 unique images, this meant that one block consisted of 182 trials. Trial-to-trial feedback was provided as in Experiment 1, and a break screen was shown every 50 trials. Each block took approximately 12 min to complete. A full data set consisted of 21 blocks (3,684 trials) for a total of around 4 hr of testing per observer. 
Prior to commencing data collection, observers performed 150 practice trials. The practice trials used different source images (from Kienzle, Franz, Scholkopf, & Wichmann, 2009). We set an a priori threshold requiring observers to achieve at least 50% correct performance (a d′ of 1.5) on the largest patch size before proceeding. All observers except one achieved this in their first practice run (S7 required two runs). 
Data analysis
We quantified the critical stimulus size for discriminating the target patches by fitting the same three-parameter function as in Experiment 1 (see also Appendix 1). We discarded trials with invalid eye movements (see below) and timing errors (29 trials) as described in Experiment 1. We discarded 0.2% of trials for S1, 38% for S2, 15% for S7, and 21% for S8.7 Including these invalid trials did not substantively change the pattern of results. The final data set consisted of 3,674 trials from S1, 2,283 trials from S2, 3,116 trials from S7, and 2,908 trials from S8. 
Results
Results and model fits are summarized in Figure 8. All observers could discriminate natural and synthesized images more easily than two synthesized images; this pattern held on average at all patch sizes except the smallest we tested. In terms of the critical patch size, three of four observers could discriminate natural from synthetic images and synthetic images from each other even for the smallest patch sizes we could generate when the patches were presented on an isolated background. When scene context was included in the form of surrounding patches, performance at the smallest patch size fell to chance for all observers. This finding corroborates the data from Experiment 1 (they are the same patch size of 0.74°) and suggests that observers are simply unable to discriminate images of such small spatial extent when surrounded by large patches at least for the two model manipulations we examined here. More generally, the addition of surrounding patches made performance worse. Finally, performance in the synth versus synth condition appeared tuned in that performance first improved and then deteriorated once more at very high scale factors. The three-parameter fit failed to capture this relationship. We return to this effect in Appendix 3
Estimates of the posterior distribution over the critical size parameter for all observers and conditions are shown in Figure 9A. In this case, the prior distributions extend up to a critical size of 12, but the y-axis has been truncated to show the most relevant range. The parameter estimates confirm impressions from visually inspecting the data in Figure 8: The critical size estimates for all observers except S7 are essentially zero, meaning that those observers could discriminate even the smallest patch sizes we presented. There are no meaningful differences evident between experimental conditions (difference score plots not shown). The critical size estimates remain low whether observers compared synthesized images to each other or to their source image. 
Figure 9
 
Parameter estimates for Experiment 2. (A) Estimates of the critical patch size parameter for each observer in each experimental condition (surround condition in panel columns). Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the median. Note that the prior intervals extend to 12; the y-axis has been truncated to show the posteriors more clearly. Critical size estimates are difficult to interpret due to a floor effect. (B) As for (A) but for the maximum sensitivity parameter (in d′ units). Nat versus synth yields higher asymptotic performance than synth versus synth. (C) As for (A) but for the slope parameter.
Figure 9
 
Parameter estimates for Experiment 2. (A) Estimates of the critical patch size parameter for each observer in each experimental condition (surround condition in panel columns). Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the median. Note that the prior intervals extend to 12; the y-axis has been truncated to show the posteriors more clearly. Critical size estimates are difficult to interpret due to a floor effect. (B) As for (A) but for the maximum sensitivity parameter (in d′ units). Nat versus synth yields higher asymptotic performance than synth versus synth. (C) As for (A) but for the slope parameter.
Gain (maximum sensitivity) varies considerably over experimental conditions. As in Experiment 1, the addition of scene context reduced asymptotic performance (see difference scores in Figure 10). For most observers, asymptotic performance was higher in the no surround condition than the condition with natural surrounds although this difference is less convincing for synthesized surrounds (with the exception of S7). Performance was slightly worse in the natural than the synthesized surround conditions although these comparisons are not credibly different from zero (Figure 10, lower panel). This effect was stronger for the natural image surrounds than the synthesized surrounds. Finally, the slope of performance improvement is generally steeper for the synth versus synth comparison condition, but this difference is likely associated with the lower maximum sensitivity under these conditions and the paucity of data points informing slope. 
Figure 10
 
Difference in gain (maximum performance in d′ units) between surround conditions. Comparisons shown in facet titles. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is generally higher in the no surround condition than the natural surround condition for both comparison types; there is weaker evidence for any difference between the no surround and synthetic surround condition or the surround conditions from each other.
Figure 10
 
Difference in gain (maximum performance in d′ units) between surround conditions. Comparisons shown in facet titles. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is generally higher in the no surround condition than the natural surround condition for both comparison types; there is weaker evidence for any difference between the no surround and synthetic surround condition or the surround conditions from each other.
Figure 11
 
Gaussian blur can produce convincing metamers. (A) Changes in spectral content produced by six blur levels (Gaussian standard deviation in pixels) and the average amplitude spectrum of the unmodified image patches (100 images, 256 pixels square). All blur levels cause physical modifications to image content. (B) Proportion correct as a function of the blur levels in (A) (Gaussian standard deviation in pixels; note logarithmic x-axis) for two observers. Error bars show 95% beta distribution confidence intervals, rule-of-succession corrected. Each data point represents at least 100 trials (S1 has done 200 for the four lowest blurs). The smallest blur levels are indiscriminable from unmodified images despite being physically different (i.e., they are metamers).
Figure 11
 
Gaussian blur can produce convincing metamers. (A) Changes in spectral content produced by six blur levels (Gaussian standard deviation in pixels) and the average amplitude spectrum of the unmodified image patches (100 images, 256 pixels square). All blur levels cause physical modifications to image content. (B) Proportion correct as a function of the blur levels in (A) (Gaussian standard deviation in pixels; note logarithmic x-axis) for two observers. Error bars show 95% beta distribution confidence intervals, rule-of-succession corrected. Each data point represents at least 100 trials (S1 has done 200 for the four lowest blurs). The smallest blur levels are indiscriminable from unmodified images despite being physically different (i.e., they are metamers).
General discussion
In this paper, we find that the periphery is highly sensitive to two types of image manipulation applied to natural scene patches. Here, we consider the primary results and their relevance to other work. 
Blur and metamerism
In Experiment 1, we quantified the largest amount of Gaussian blur that could be applied before modified scenes were no longer metameric for their source images (the critical blur). Generally, these results showed that our observers could peripherally discriminate even small amounts of Gaussian image blur under our experimental conditions (see Tadmor & Tolhurst, 1994; Tolhurst & Tadmor, 1997, for similar studies in parafoveal viewing). Although performance for the smallest blur levels we tested in Experiment 1 was at or near chance, we wished to confirm that our experimental setting could produce convincing metamers; after all, there should be some blur level that is undetectable in the periphery. We conducted a supplementary experiment using a new randomly selected set of 100 images and even smaller blur levels, but it was otherwise identical to the 5.95°, no surround condition of Experiment 1. Figure 11A shows that these small blur levels produced physical image changes (reducing amplitude at higher spatial frequencies), and Figure 11B shows that these blur levels could not be discriminated from the original images. Blur metamers can be measured in our experimental setting. 
The comparisons in Figure 6 suggest that humans are more sensitive to image blur in peripheral natural scenes than predicted from grating detection thresholds (specifically, the instatiation used by Geisler & Perry, 1998). This is consistent with the recent findings of Sebastian, Burge, and Geisler (2015), who showed that observers were more sensitive to defocus blur (caused by the optics of the eye rather than applied to images on screen as in our study) in natural image patches than predicted from studies of image blur in artificial stimuli. However, our results are superficially inconsistent with several studies that have measured discrimination of eccentricity-dependent blur from unmodified scenes (Loschky et al., 2005; Peli & Geri, 2001; Séré et al., 2000). These studies suggest that grating experiments provide quite good predictions for detectability in natural scenes; if anything, observers are slightly less sensitive to eccentricity-dependent blur than might be predicted from contrast or acuity thresholds, a result attributed to masking or crowding in natural scenes. The discrepancy between our results and these studies may be due to the specific form of the assumed foveal contrast sensitivity function (which can matter even in the periphery; Peli & Geri, 2001), the specific implementation of eccentricity-dependent blurring (consider that we used simple Gaussian blur and our comparison to Geisler & Perry, 1998, is indirect), differences in experimental paradigms (temporal oddity in our case; spatial 2AFC, Peli & Geri, 2001; temporal 2AFC, Séré et al., 2000; yes/no detection of gaze-contingent blurring injected on single fixations in dynamic viewing, Loschky et al., 2005), screen luminance (our display had a higher mean luminance than is typical, and sensitivity to high spatial frequencies continues to improve even for high retinal illuminance; Rovamo et al., 1994), our use of localized patches rather than “full-field” scenes (perhaps improving performance via focused spatial attention), and even the definition of “detectability” (the point at which d′ rises above zero in our case and in Loschky et al., 2005 and Séré et al., 2000 vs. a d′ of ≈1.3 in Peli & Geri, 2001). Directly comparing these models within a single experimental setting may help to resolve cause(s) of the discrepancy. 
Finally, observers in our experiment could have used a number of cues to discriminate modified from unmodified patches. As pointed out in the Introduction, this can be viewed as an advantage in this paradigm in that many potential cues can be tested simultaneously to the extent that a model predicts they should not matter. For example, in Experiment 1, observers could be sensitive to the selective attenuation of high spatial frequency content (i.e., a change in the slope of the amplitude spectrum) or to changes in the total contrast energy of the stimulus (for example, by detecting contrast decrements at lower spatial frequencies than the cutoffs calculated above; a necessary consequence of Gaussian blur). Although our experiment does not determine the relative importance of these cues, discriminating blurred from naturalistic spectra remains highly sensitive despite carefully controlling the total contrast energy (Sebastian et al., 2015; Tadmor & Tolhurst, 1994; Tolhurst & Tadmor, 1997), implying that broadband contrast discrimination will have limited influence on our results. Irrespective of the relative importance of these cues, our results suggest that human peripheral vision is more sensitive to blur than one might expect from the Geisler and Perry (1998) model (although see paragraph above). 
Texture statistics and metamerism
In Experiment 2, we manipulated the size of the image region over which texture synth statistics were computed by varying the size of the target patch. Most observers could discriminate synthesized patches from the unmodified images and different synthesized patches from each other even for the smallest patch size we could generate. Observers were only reliably at chance performance when task-irrelevant surrounding structure was added to the display (Figure 8). The two comparison types (natural vs. synth and synth vs. synth) differed strongly in their asymptotic performance in that it was easier to tell apart a synthesized image from its unmodified source than two synthesized patches from each other.8 Finally, there was evidence of a tuning effect in the synth versus synth comparison type, such that performance decreased again at large patch sizes (an effect that our model cannot capture). In Appendix 3, we show that this effect disappears when smaller stimuli are synthesized from down-sampled (rather than cropped) patches, suggesting that the fidelity of the texture synthesis interacts with the scale of features in the scenes. 
J. Freeman and Simoncelli (2011) developed a model in which the texture synth features were computed within overlapping pooling regions that were relatively small near the fovea and became larger in the periphery. The rate of this receptive field size change (scaling; the diameter of the pooling region divided by the eccentricity at its center) was parametrically varied to generate stimuli with different scaling factors. J. Freeman and Simoncelli showed that human observers could not discriminate synthetic stimuli from each other for scaling factors of 0.5 or below. That is, the largest region over which the texture synth statistics could be pooled to produce indiscriminable model samples was half the eccentricity. Regions of the ventral visual stream of macaque monkeys show different scaling factors. V1 receptive fields remain relatively small as retinal eccentricity increases (scalings of 0.25), V2 receptive fields increase with a scale factor of around 0.5, and V4 receptive fields increase with a scale factor of around 0.8 (J. Freeman & Simoncelli, 2011). Because V2 has a scaling factor of 0.5, and the psychophysically estimated critical scaling factor was 0.5, J. Freeman and Simoncelli suggested that the texture synth features may be good approximations to the computations performed in V2. The prediction of a relationship between texture synth features and V2 has since received physiological support (J. Freeman et al., 2013), and the computations used in the model have been argued to support the idea that each ventral visual area forms part of a cascade in which the same operations are performed on the inputs from the previous layer, leading eventually to the abstracted representations of IT cortex (Movshon & Simoncelli, 2014). 
In Experiment 2, we examined the standard texture synth features (as used in J. Freeman et al., 2013), not the stimuli produced by the complex stimulus generation method used in J. Freeman and Simoncelli (2011). The crucial difference between these stimuli is that the standard texture synthesis algorithm generates shift-invariant textures whereas the J. Freeman and Simoncelli algorithm does not. That is, the standard algorithm matches the texture synth features of an image but discards the absolute position of features within the synthesis region. In contrast, the synthesis method of J. Freeman and Simoncelli computes the texture synth features in overlapping pooling regions, and the gradient projection procedure iteratively descends over all parameters across the entire image. The overlap and this form of gradient descent places strong constraints on the possible content of the pooling regions, breaking shift invariance and yielding images that are physically closer to the original images.9 
Our stimuli are therefore not interpretable in terms of the scaling factors studied in J. Freeman and Simoncelli (2011). Although our isolated patches could be argued to have a “scaling” of the patch size in degrees divided by 10 (the retinal eccentricity of the patches in our experiments), the scaling in our stimuli is matched only at 10°. Receptive fields falling to either side of the stimulus center would receive texture synth statistics matched for the wrong scale, potentially facilitating discriminability. Nevertheless, there are several relationships between our findings and theirs that can be discussed. 
We find that standard texture synth stimuli are discriminable from each other and from their original source images for even the smallest pooling region we could test (see leftmost data points in the “no surround” condition, Figure 8). The discrepancy between these results and those of J. Freeman and Simoncelli (2011) suggests that the amount of pooling region overlap is an important parameter in the J. Freeman and Simoncelli model. For our stimuli, the pooling regions did not overlap, and if our stimulus sizes are interpreted as “scalings” (see above), we find sensitivity well below 0.5 (divide the critical size estimates in Figure 9B by 10). J. Freeman and Simoncelli used overlapping pooling regions and found critical scalings of 0.5. These results together suggest that metamerism at a scale of 0.5 depends on the pooling regions and global gradient descent used in that model rather than the texture synth statistics per se. 
According to a texture model, any two images with the same set of shift-invariant features should be metameric. Studies of phase discrimination in grating stimuli show that sensitivity to absolute position is reduced in the periphery (Banks, Bennett, & Gubrud, 1986; Bennett & Banks, 1987, 1991). If our experiments had shown that observers could not discriminate texture syntheses in the periphery, then a simple conclusion would have been possible: Absolute position information up to that critical pooling region is lost. This would have extended the results from grating studies to natural scenes. 
An important open question is whether the critical scale factors estimated by J. Freeman and Simoncelli (2011) using their more complex overlapping pooling region model would differ if synthesized images were compared to their original source images. As outlined in the Introduction, a strong test of a model of peripheral appearance is to compare the model-matched synthesized images to the natural scenes to which they are matched. For patch sizes above the critical size, we find it is easier to discriminate synthesized images from their natural source images than synthesized images from each other. This shows that perceptually important features not encoded by the simpler texture synth model are discriminable at the pooling region sizes we tested. 
Influence of surrounding image content on performance
We investigated the influence of the broader spatial context of the scene by presenting surrounding image patches cropped from either side of the source image. In general, including surrounds slightly reduced asymptotic sensitivity (see Figures 5 and 10). It was plausible that adding surrounding context might have facilitated performance, perhaps by providing a prediction for the true spatial structure of the scene (Bex, 2010; Neri, 2014). This is particularly clear for Experiment 2, in which the natural surrounding structure (middle column of Figures 8 and 9) could have facilitated discrimination of synthesized patches because they would not necessarily match the position of any contour present in the scene. Instead, our results from both experiments are consistent with a detrimental influence of spatially distributed contrast gain control (Bex et al., 2007; Geisler & Albrecht, 1992; Heeger, 1992) or crowding-like effects (Levi, 2008; Wallis & Bex, 2012) and inconsistent with any facilitatory mechanism, such as contour completion or contextual guidance (Larson & Loschky, 2009; Neri, 2011, 2014; Oliva & Torralba, 2006). Indeed, if anything, performance was worse with natural surrounds than synthetic surrounds in Experiment 2 (Figure 10). This is not to say that facilitatory effects do not occur in peripheral scene perception, only that detrimental effects dominate under our conditions and relative to performance on an isolated image patch. 
Conclusion
An oddity paradigm comparing original and modified scenes probes information loss (instantiated by image perturbations) with high sensitivity. Perceptually important features not captured by a candidate model will result in images that do not look the same as the original scenes. Relative to the general idea that the periphery loses a lot of information, we find that observers are highly sensitive to the two types of image perturbations we tested: blur and the Portilla and Simoncelli (2000) texture features. The experimental design used here provides a high-sensitivity paradigm for testing hypotheses about peripheral information loss. 
Acknowledgments
Designed the experiments: TSAW, FAW, MB. Programmed the experiments: TSAW. Collected the data: TSAW. Analyzed the data: TSAW. Wrote the paper: TSAW, FAW, MB. The authors thank Annelie Muehler and Britta Lewke for assistance with data collection. We are grateful to Eero Simoncelli for clarifying the distinctions between the texture synthesis of Portilla and Simoncelli (2000) and the model explored by J. Freeman and Simoncelli (2011). We would like to credit an anonymous reviewer for suggesting elements of the final two sentences of our abstract. Elements of this work were presented at the Vision Sciences Society meeting in 2015. TSAW was supported by an Alexander von Humboldt Postdoctoral Fellowship. Funded, in part, by the German Federal Ministry of Education and Research (BMBF) through the Bernstein Computational Neuroscience Program Tübingen (FKZ: 01GQ1002), the German Excellency Initiative through the Centre for Integrative Neuroscience Tuebingen (EXC307), and the German Science Foundation (DFG; priority program 1527, BE 3848/2-1). 
Commercial relationships: none. 
Corresponding author: Thomas S. A. Wallis. 
Email: thomas.wallis@uni-tuebingen.de. 
Address: Eberhard Karls Universität Tübingen, Tübingen, Germany. 
References
Alam, M. M., Vilankar K. P., Field D. J., Chandler D. M. (2014). Local masking in natural images: A database and analysis. Journal of Vision, 14 (8): 22, 1–38, doi:10.1167/14.8.22. [PubMed] [Article]
Andriessen J. J., Bouma H. (1975). Eccentric vision: Adverse interactions between line segments. Vision Research, 16 (l), 71–78.
Arnow T. L., Geisler W. S. (1996). Visual detection following retinal damage: Predictions of an inhomogeneous retino-cortical model. In B. E. Stuck & M. Belkin (Eds.), Proceedings of the SPIE, 2674, laser-inflicted eye injuries: epidemiology, prevention, and treatment (pp. 119–130). San Jose, CA: SPIE.
Artal P. (2014). Optics of the eye and its impact in vision: A tutorial. Advances in Optics and Photonics, 6 (3), 340.
Balas B. J. (2006). Texture synthesis and perception: Using computational models to study texture representations in the human visual system. Vision Research, 46 (3), 299–309.
Balas B. J. (2008). Attentive texture similarity as a categorization task: Comparing texture synthesis models. Pattern Recognition, 41 (3), 972–982.
Balas B. J. (2012). Contrast negation and texture synthesis differentially disrupt natural texture appearance. Frontiers in Psychology, 3, 515.
Balas B. J., Conlin C. (2015). Invariant texture perception is harder with synthetic textures: Implications for models of texture processing. Vision Research, 115, 271–279.
Balas B. J., Nakano L., Rosenholtz R. (2009). A summary-statistic representation in peripheral vision explains visual crowding. Journal of Vision, 9 (12): 13, 1–18, doi:10.1167/9.12.13. [PubMed] [Article]
Baldwin A., Meese T., Baker D. (2012). The attenuation surface for contrast sensitivity has the form of a witch's hat within the central visual field. Journal of Vision, 12 (11): 23, 1–17, doi:10.1167/12.11.23. [PubMed] [Article]
Banks M. S., Bennett P. J., Gubrud G. A. (1986). Phase discrimination in the normal and amblyopic fovea (A). Journal of the Optical Society of America, A, 3, 56.
Banks M. S., Sekuler A. B., Anderson S. J. (1991). Peripheral spatial vision: Limits imposed by optics, photoreceptors, and receptor pooling. Journal of the Optical Society of America, A, 8 (11), 1775–1787.
Bennett P. J., Banks M. S. (1987, April 30). Sensitivity loss in odd-symmetric mechanisms and phase anomalies in peripheral vision. Nature, 326 (6116), 873–876.
Bennett P. J., Banks M. S. (1991). The effects of contrast, spatial scale, and orientation on foveal and peripheral phase discrimination. Vision Research, 31 (10), 1759–1786.
Bex P. J. (2010). (In) sensitivity to spatial distortion in natural scenes. Journal of Vision, 10 (2): 23, 1–15, doi:10.1167/10.2.23. [PubMed] [Article]
Bex P. J., Mareschal I., Dakin S. C. (2007). Contrast gain control in natural scenes. Journal of Vision, 7 (11): 12, 1–12, doi:10.1167/7.11.12. [PubMed] [Article]
Bex P. J., Solomon S. G., Dakin S. C. (2009). Contrast sensitivity in natural scenes depends on edge as well as spatial frequency structure. Journal of Vision, 9 (10): 1, 1–19, doi:10.1167/9.10.1. [PubMed] [Article]
Blackwell H. R. (1952). Studies of psychophysical methods for measuring visual thresholds. JOSA, 42 (9), 606–614.
Blakemore C., Campbell F. W. (1970, April 11). On the existence of neurones in the human visual system selectively sensitive to the orientation and size of retinal images. The Journal of Physiology, 203, 237–260.
Bouma H. (1970). Interaction effects in parafoveal letter recognition. Nature, 226 (5241), 177–178.
Bradley C., Abrams J., Geisler W. S. (2014). Retina-V1 model of detectability across the visual field. Journal of Vision, 14 (12): 22, 1–22, doi:10.1167/14.12.22. [PubMed] [Article]
Campbell F. W., Robson J. G. (1968). Application of Fourier analysis to the visibility of gratings. The Journal of Physiology, 197 (3), 551–566.
Cowey A., Rolls E. (1974). Human cortical magnification factor and its relation to visual acuity. Experimental Brain Research, 21 (5), 447–454.
Craven B. J. (1992). A table of d' for M-alternative odd-man-out forced-choice procedures. Perception & Psychophysics, 51 (4), 379–385.
Crouzet S. M., Serre T. (2011). What are the visual features underlying rapid object recognition? Frontiers in Psychology, 2, 326.
Curcio C. A., Allen K. A. (1990). Topography of ganglion cells in human retina. The Journal of Comparative Neurology, 300 (l), 5–25.
Curcio C. A., Sloan K. R., Kalina R. E., Hendrickson A. E. (2004). Human photoreceptor topography. The Journal of Comparative Neurology, 292 (4), 497–523.
De Valois R., De Valois K. (1990). Vernier acuity with stationary moving Gabors. Vision Research, 31 (9), 1619–1626.
Dorr M., Bex P. J. (2013). Peri-saccadic natural vision. Journal of Neuroscience, 33 (3), 1211–1217.
Einhäuser W., Spain M. & Perona P. (2008). Objects predict fixations better than early saliency. Journal of Vision, 8 (14): 18, 1–26, doi:10.1167/8.14.18. [PubMed] [Article]
Emrith K., Chantler M. J., Green P. R., Maloney L. T., Clarke A. D. F. (2010). Measuring perceived differences in surface texture due to changes in higher order statistics. Journal of the Optical Society of America, A, Optics, Image Science, and Vision, 27 (5), 1232–1244.
Freeman J., Simoncelli E. P. (2011). Metamers of the ventral stream. Nature Neuroscience, 14 (9), 1195–1201.
Freeman J., Ziemba C., Heeger D. J., Simoncelli E. P., Movshon J. A. (2013). A functional and perceptual signature of the second visual area in primates. Nature Neuroscience, 16 (7), 974–981.
Freeman W. T., Adelson E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13 (9), 891–906.
Fründ I., Haenel N. V., Wichmann F. A. (2011). Inference for psychometric functions in the presence of nonstationary behavior. Journal of Vision, 11 (6): 16, 1–19, doi:10.1167/11.6.16. [PubMed] [Article]
Gattass R., Sousa A. P., Gross C. G. (1988). Visuotopic organization and extent of V3 and V4 of the macaque. Journal of Neuroscience, 8 (6), 1831–1845.
Geisler W. S., Albrecht D. G. (1992). Cortical neurons: Isolation of contrast gain control. Vision Research, 32 (8), 1409–1410.
Geisler W. S., Perry J. S. (1998). Real-time foveated multiresolution system for low-bandwidth video communication. In B. E. Rogowitz & T. N. Pappas (Eds.), Proceedings of the SPIE, 3299, human vision and electronic imaging III (pp. 294–305). San Jose, CA: SPIE.
Gelman A. (2006). Multilevel (hierarchical) modeling: What it can and cannot do. Technometrics, 48 (3), 432–435.
Gelman A., Hill J. (2007). Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press.
Gelman A., Rubin D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7 (4), 457–472.
Gelman A., Shalizi C. R. (2012). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66 (l), 8–38.
Gerhard H. E., Wichmann F. A., Bethge M. (2013). How sensitive is the human visual system to the local statistics of natural images? PLoS Computational Biology, 9 (1), e1002873.
Graham N., Nachmias J. (1971). Detection of grating patterns containing two spatial frequencies: A comparison of single-channel and multiple-channels models. Vision Research, 11 (3), 251–259.
Haun A. M., Peli E. (2013). Perceived contrast in complex images. Journal of Vision, 13 (13): 3, 1–21, doi:10.1167/13.13.3. [PubMed] [Article]
Heeger D. J., Bergen J. (1995). Pyramid-based texture analysis/synthesis. In S. G. Mair & R. Cook (Eds.), Proceedings of the 22nd annual conference on computer graphics and interactive techniques (pp. 229–238). New York: ACM.
Heeger D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9 (2), 181–197.
Herrmann K., Montaser-Kouhsari L., Carrasco M., Heeger D. J. (2010). When size matters: Attention affects performance by contrast or response gain. Nature Neuroscience, 13 (12), 1554–1559.
Hocke J., Dorr M., Barth E. (2012). A compressed sensing model of crowding in peripheral vision. In B. E. Rogowitz, T. N. Pappas, & H. de Ridder (Eds.), Proceedings of the SPIE, 8291, human vision and electronic imaging XVII , doi:10.1117/12.908664. Burlingame, CA: SPIE.
Jäkel F., Wichmann F. A. (2006). Spatial four-alternative forced-choice method is the preferred psychophysical method for naive observers. Journal of Vision, 6 (11): 13, 1307–1322, doi:10.1167/6.11.13. [PubMed] [Article]
Judd T., Durand F., Torralba A. (2012). A benchmark of computational models of saliency to predict human fixations. CSAIL Technical Reports.
Judd T., Ehinger K., Durand F., Torralba A. (2009). Learning to predict where humans look. In Computer Vision, 2009 IEEE, 12th International Conference on (pp. 2106–2113). Kyoto, Japan: IEEE.
Kelly D. H. (1984). Retinal inhomogeneity. I. Spatiotemporal contrast sensitivity. Journal of the Optical Society of America A, 1 (1), 107–113.
Kienzle W., Franz M. O., Scholkopf B., Wichmann F. A. (2009). Center-surround patterns emerge as optimal predictors for human saccade targets. Journal of Vision, 9 (5): 7, 1–15, doi:10.1167/9.5.7. [PubMed] [Article]
Kleiner M., Brainard D. H., Pelli D. G. (2007). What's new in Psychtoolbox-3? Perception, 36, Abstract.
Kruschke J. K. (2011). Doing Bayesian data analysis. Burlington, MA: Academic Press/Elsevier.
Kuss M., Jäkel F., Wichmann F. A. (2004). Bayesian inference for psychometric functions. Journal of Vision, 5 (5): 8, 478–492, doi:10.1167/5.5.8. [PubMed] [Article]
Larson A. M., Loschky L. C. (2009). The contributions of central versus peripheral vision to scene gist recognition. Journal of Vision, 9 (10): 6, 1–16, doi:10.1167/9.10.6. [PubMed] [Article]
Lee M. D., Wagenmakers E.-J. (2014). Bayesian cognitive modeling: A practical course. Cambridge, UK: Cambridge University Press.
Levi D. M. (2008). Crowding–An essential bottleneck for object recognition: A mini-review. Vision Research, 48 (5), 635–654.
Loschky L., McConkie G., Yang J., Miller M. (2005). The limits of visual resolution in natural scene viewing. Visual Cognition, 12 (6), 1057–1092.
Love J., Selker R., Marsman M., Jamil T., Dropmann D., Verhagen A. J., Wagenmakers E.-J. (2015). JASP (Version 0.7). Computer software.
Macmillan N. A., Creelman C. D. (2005). Detection theory: A user's guide. Mahwah, NJ: Lawrence Erlbaum.
McDonald J. S., Tadmor Y. (2006). The perceived contrast of texture patches embedded in natural images. Vision Research, 46 (19), 3098–3104.
Morey R. D., Rouder J. N. (2015). BayesFactor. Computer software.
Motter B. C. (2009). Central V4 receptive fields are scaled by the V1 cortical magnification and correspond to a constant-sized sampling of the V1 surface. Journal of Neuroscience, 29 (18), 5749.
Movshon J. A., Simoncelli E. P. (2014). Representation of naturalistic image structure in the primate visual cortex. Cold spring harbor symposia on quantitative biology ( pp. 115–122). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Neri P. (2011). Global properties of natural scenes shape local properties of human edge detectors. Frontiers in Psychology, 2, 172.
Neri P. (2014). Semantic control of feature extraction from natural scenes. Journal of Neuroscience, 34 (6), 2374–2388.
Nuthmann A., Einhäuser W. (2015). A new approach to modeling the influence of image features on fixation selection in scenes: Modeling fixation selection in scenes. Annals of the New York Academy of Sciences, 1339 (l), 82–96.
Okazawa G., Tajima S., Komatsu H. (2015). Image statistics underlying natural texture selectivity of neurons in macaque V4. Proceedings of the National Academy of Sciences, USA, 112 (4), E351–E360.
Oliva A., Torralba A. (2006). Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 155, 23–36.
Peli E., Geri G. A. (2001). Discrimination of wide-field images as a test of a peripheral-vision model. JOSA A, 18 (2), 294–301.
Peli E., Yang J., Goldstein R. B. (1991). Image invariance with changes in size: The role of peripheral contrast thresholds. JOSA A, 8 (11), 1762–1774.
Pelli D. G., Tillman K. A. (2008). The uncrowded window of object recognition. Nature Neuroscience, 11 (10), 1129–1135.
Perry J. S., Geisler W. S. (2002). Gaze-contingent real-time simulation of arbitrary visual fields. In B. E. Rogowitz & T. N. Pappas (Eds.), Proceedings of the SPIE 4662, human vision and electronic imaging VII (pp. 57–69). San Jose, CA: SPIE.
Portilla J., Simoncelli E. P. (2000). A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40 (l), 49–70.
Robson J. G., Graham N. (1981). Probability summation and regional variation in contrast sensitivity across the visual field. Vision Research, 21 (3), 409–418.
Rosenholtz R., Huang J., Ehinger K. A. (2012). Rethinking the role of top-down attention in vision: Effects attributable to a lossy representation in peripheral vision. Frontiers in Psychology, 3, 13.
Rouder J. N., Morey R. D., Speckman P. L., Province J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56 (5), 356–374.
Rovamo J., Mustonen J., Näsänen R. (1994). Modelling contrast sensitivity as a function of retinal illuminance and grating area. Vision Research, 34 (10), 1301–1314.
Rovamo J., Virsu V., Näsänen R. (1978, Jan 5). Cortical magnification factor predicts the photopic contrast sensitivity of peripheral vision. Nature, 271 (5640), 54–56.
Russell B. C., Torralba A., Murphy K. P., Freeman W. T. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77 (1), 157–173.
Sebastian S., Burge J., Geisler W. S. (2015). Defocus blur discrimination in natural images with natural optics. Journal of Vision, 15 (5): 16, 1–17, doi:10.1167/15.5.16. [PubMed] [Article]
Séré B., Marendaz C., Hérault J. (2000). Nonhomogeneous resolution of images of natural scenes. Perception, 29 (12), 1403–1412.
Sorensen T., Hohenstein S., Vasishth S. (2015). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. arXiv 1506.06201 . Available at http://arxiv.org/abs/1506.06201
Stan Development Team. (2015). Stan: A C++ library for probability and sampling, version 2.6.0.
Tadmor Y., Tolhurst D. (1994). Discrimination of changes in the second-order statistics of natural and synthetic images. Vision Research, 34 (4), 541–554.
Thaler L., Schütz A. C., Goodale M. A., Gegenfurtner K. R. (2013). What is the best fixation target? The effect of target shape on stability of fixational eye movements. Vision Research, 76, 31–42.
Thomson M. G. A., Foster D. H., Summers R. J. (2000). Human sensitivity to phase perturbations in natural images: A statistical framework. Perception, 29 (9), 1057–1070 .
To M. P. S., Gilchrist I., Troscianko T., Tolhurst D. J. (2011). Discrimination of natural scenes in central and peripheral vision. Vision Research, 51 (14), 1686–1698.
Toet A., Levi D. M. (1992). The two-dimensional shape of spatial interaction zones in the parafovea. Vision Research, 32 (7), 1349–1357.
Tolhurst D. J., Tadmor Y. (1997). Band-limited contrast in natural images explains the detectability of changes in the amplitude spectra. Vision Research, 37 (23), 3203–3215.
van der Walt S., Schönberger J. L., Nunez-Iglesias J., Boulogne F., Warner J. D., Yager N., … the scikit-image contributors. (2014). scikit-image: Image processing in Python. PeerJ, 2, e453.
Vilankar K. P., Golden J. R., Chandler D. M., Field D. J. (2014). Local edge statistics provide information regarding occlusion and nonocclusion edges in natural scenes. Journal of Vision, 14 (9): 13, 1–21, doi:10.1167/14.9.13. [PubMed] [Article]
Vilankar K. P., Vasu L., Chandler D. M. (2011). On the perception of band-limited phase distortion in natural scenes. In B. E. Rogowitz & T. N. Pappas (Eds.), Proceedings of the SPIE 7865, human vision and electronic imaging XVI, 78650C, doi:10.1117/12.872657.
Vincent B. T. (2015). A tutorial on Bayesian models of perception. Journal of Mathematical Psychology, 66, 103–114.
Wallis T. S. A., Bex P. J. (2012). Image correlates of crowding in natural scenes. Journal of Vision, 12 (7): 6, 1–19, doi:10.1167/12.7.6. [PubMed] [Article]
Wallis T. S. A., Dorr M., Bex P. J. (2015). Sensitivity to gaze-contingent contrast increments in naturalistic movies: An exploratory report and model comparison. Journal of Vision, 15 (8): 3, 1–33, doi:10.1167/15.8.3. [PubMed] [Article]
Wallis T. S. A., Taylor C. P., Wallis J., Jackson M. L., Bex P. J. (2014). Characterization of field loss based on microperimetry is predictive of face recognition difficulties. Investigative Ophthalmology & Visual Science, 55 (1), 142–153. [PubMed] [Article]
Watson A. B. (2014). A formula for human retinal ganglion cell receptive field density as a function of visual field location. Journal of Vision, 14 (7): 15, 1–17, doi:10.1167/14.7.15. [PubMed] [Article]
Watson A. B., Ahumada A. J., Farrell J. E. (1986). Window of visibility: A psychophysical theory of fidelity in time-sampled visual motion displays. Journal of the Optical Society of America A , 3 (3), 300–307.
Wichmann F., Drewes J., Rosas P., Gegenfurtner K. R. (2010). Animal detection in natural scenes: Critical features revisited. Journal of Vision, 10 (4): 6, 1–27, doi:10.1167/10.4.6. [PubMed] [Article]
Wichmann F. A., Braun D. I., Gegenfurtner K. R. (2006). Phase noise and the classification of natural images. Vision Research, 46 (8), 1520–1529.
Footnotes
1  Note also that our choice to use temporal intervals means that working memory may significantly influence performance. The memory component of our temporal oddity task is arguably smaller than in ABX, also called delayed match-to-sample, which is often employed as a test of working memory rather than perception.
Footnotes
2  The experiments reported in this paper were exploratory. The number of participants was determined based on availability rather than set a priori, and some analysis decisions were made having seen the data. This is one reason that we do not rely on p values for inference, instead using a Bayesian quantification of uncertainty. In addition, we conducted a number of related pilot experiments not reported in this paper, in which we investigated different stimulus and task configurations (yielding qualitatively similar results to those reported here).
Footnotes
3  The distribution of prestandardized RMS contrasts in the final set of 200 unique images had a mean of 0.55 (SD = 0.20).
Footnotes
4  To ensure that each inner, middle, and outer patch contained visible image content, we excluded images with an RMS contrast of less than 0.1 in any patch.
Footnotes
5  Observe that the violin plots of the prior distributions have a bow tie shape. The heavier prior density approaching the allowable bounds of the parameters (0 and 16 for critical scale, 0 and 5 for maximum sensitivity, and 0 and 10 for slope) is due to the parameterization of the model. Because we bound the parameters using an inverse logit nonlinearity, the priors we place over the linear predictor scale are distorted when examined on the scale of the parameter (see Appendix 1 for details). Given that the posterior densities can move well away from areas with less prior density and the model fits (posterior predictive check) appear reasonable (Figure 8), this parameterization has little effect on our core conclusions. The same comment applies to Experiment 2.
Footnotes
6  The code for Geisler and Perry's Space Variant Imaging System Toolbox (SVIS) is available at http://github.com/jeffsp/svis. The size of the resolution map was set to match our experimental conditions; other settings remained at their defaults, which correspond to the parameters reported to match grating detection experiments in Geisler and Perry (1998). We thank Jeff Perry for providing the code and troubleshooting help.
Footnotes
7  This highlights one advantage of using highly experienced observers in psychophysical experiments. Despite some (at least 150 trials) training and the experimenter's pleas, our inexperienced subjects were quite inefficient in their eye movement control over such a long trial duration (1.6 s). Compare these numbers to the relatively low rates of fixation breaking in Experiment 1 for experienced observers under the same temporal stimulus conditions.
Footnotes
8  The two comparison types (natural vs. synth and synth vs. synth) had similar near-zero critical patch size estimates, but because of the obvious floor effect, this equality is difficult to interpret.
Footnotes
9  Balas et al. (2009) achieved a similar effect for single pooling regions by seeding the texture synth gradient descent procedure with a low-pass, noisy version of the original image rather than with white noise (as is the default and what we used in this paper). This encourages the gradient descent to maintain the global position of coarse-scale features in the original scene (see Balas et al., 2009, their figure 7).
Footnotes
10  The offset parameter was coded as the weighted sum of main effect and interaction terms as in a generalized linear model. We created a design matrix for the conditions in the experiment with sum-to-zero orthogonal contrasts (using Python's Patsy module), including all main effect and interaction terms. The sum of these terms weighted by their coefficients gives an offset for each condition.
Footnotes
11  Similar results were found by taking the mean.
Appendix 1: Modeling
Our experiments sought to measure the stimulus level (blur for Experiment 1, stimulus size for Experiment 2) at which the images begin to appear different. To quantify this stimulus level, we adapted the model fitting technique employed by J. Freeman and Simoncelli (2011) by having one parameter determine the stimulus level below which d′ is zero. 
Discriminability at stimulus level x was determined by a three-parameter function:  consisting of the critical stimulus level γ, a maximum sensitivity parameter α that controlled the asymptotic performance level (in units of d′), and a slope parameter β that controlled the rate of performance improvement for stimulus levels between the critical stimulus value and the maximum. Intermediate performance values had a sigmoidal shape given by the hyperbolic tangent function. We chose the hyperbolic tangent for mathematical convenience because its value is zero when its argument is zero (unlike, e.g., the inverse logit).  
We now have a prediction for d′ given some parameters, but the data we wish to fit are proportion correct performances. To our knowledge, there is no published analytical link function between d′ and hit rate in a three-alternative oddity task. For an observer considered to compute the smallest difference between stimulus pairs then select the stimulus not in that pair as the oddball, Macmillan and Creelman (2005) recommend the tables of Craven (1992), which were generated via numerical simulation of an observer model. Because such a lookup table will not work with gradient-based optimization methods (see below), we instead approximated this link function using a Weibull function:  for scale λ, shape k and m set to three (the number of alternatives). We minimized the squared difference between the Weibull and simulated results as in Craven (1992). Interestingly, this function fit the table values essentially perfectly with scale and shape parameters of 2.84268839 and 1.86565947, respectively. There is perhaps a theoretical result to be developed here, but for the present paper, this function serves to link our prediction of d′ to performance.  
We now wish to determine plausible values for the three parameters of the d′ function for each experimental condition and observer. To do this, we estimate the posterior distribution of the model simultaneously for all observers and conditions using a multilevel model. The data (the number of correct responses from some number of trials for each stimulus level, subject, and condition) are assumed to be binomially distributed with expected proportion of successes given by the link function above. 
We treat the three parameters of the d′ function for each subject in each condition as arising from a linear combination of a mean parameter for the subject (grand mean across all conditions) plus an offset caused by the condition. To describe this model structure in more detail, we give a worked example for the critical stimulus level parameter γi,j, corresponding to the critical stimulus level for subject i in condition j; the same structure applies to the other two parameters.    
First, we consider each subject i to have an unknown grand mean ai, which could take any real value. This represents the subject's mean critical stimulus level (on the linear predictor scale) across all conditions. Second, for each experimental condition j an offset parameter bi,j was estimated,10 also on the linear predictor scale. If bi,j was zero, it would mean that condition j is not credibly different from the grand mean over all subject i's data. This linear predictor was then passed through the inverse logit function, giving a sigmoid in the [0, 1] range. Finally, the result is multiplied by a fixed upper bound c determined for each experiment according to the meaningful range of the parameter. In all experiments, the upper bound of the maximum sensitivity parameter α was set to five. A d′ larger than five is hard to measure in a discrimination paradigm because performance asymptotes. The slope parameter β was given a maximum of 10. Although quite arbitrary, this maximum allowed reasonable fits to the data where the slope parameter could be meaningfully constrained and ruled out step functions for conditions in which the slope could not be measured. Finally, the critical stimulus level parameter γ was given an upper bound of 16 in the blur experiment and 12 in the texture synth experiment. This corresponds to the largest stimulus value we measured in each experiment and so represents an extremely large possible range for the critical stimulus value (essentially allowing performance to never exceed chance). 
The model becomes “multilevel” when we assume that the parameters of individual subjects arise from some unknown population distribution with some mean and variance. This assumption can lead to more robust inference: Subjects or conditions with fewer data points are estimated to be closer to the population mean unless the data strongly suggest otherwise, meaning that parameter estimates are conservatively constrained relative to independent maximum-likelihood estimates (see Gelman, 2006; Gelman & Hill, 2007; Gelman & Shalizi, 2012; Kruschke, 2011, for further discussion). We estimated population mean and variance parameters of the subject mean and offset parameters a and b. We denote these population parameters as, for example, μa and σa
We imposed weakly informed priors based on the problem domain. This serves to make very implausible parameter values unlikely while leaving the data free to speak. The large differences between prior and posterior distributions evident in our results show this to be the case. Specifically, the population-level mean of parameters on the linear predictor scale (μa and μb for each parameter) were assumed to be normal with mean 0 and SD 0.5, thus placing the a priori estimate of the parameter in the middle of our assumed range. The spread of the parameter values across subjects (σa and σb for each parameter) were assumed to be a truncated Cauchy with location 0 and scale 1 (and a lower bound of 0). We examined the dependence of our conclusions on these prior variance terms by fitting models in which prior variances were halved and doubled (not shown). The central tendency of the posterior model parameters was essentially unchanged; halving (doubling) prior uncertainty reduced (increased) the width of the 95% credible intervals. Therefore, our core conclusions do not depend strongly on the prior specifications. For other model-fitting and parameterization details, interested readers are encouraged to consult the code (available at http://doi.org/10.5281/zenodo.34218). 
Because the posterior distribution of this model is not analytically derivable, we estimate it using a MCMC technique. Specifically, we use the Stan package via its PyStan interface (version 2.6.0; Stan Development Team, 2015). MCMC methods sample from unknown probability distributions; when the algorithm has converged, the probability of drawing a sample is equivalent to its probability under the distribution. In this way, properties of the target distribution can be estimated by computing the relevant statistic (e.g., the mean) over the samples. In this case, the probability distribution we are estimating is the posterior of the model parameters given the data, prior, and model. 
Samples were derived from four independent chains of 10,000 iterations each, the first half of which was treated as a “warm-up” for the purpose of tuning sampling parameters (these samples were discarded from analysis). To reduce file size and autocorrelation in the final samples, we saved only every fifth sample. This procedure resulted in a final count of 4,000 samples (1,000 per chain). Convergence was assessed via the statistic (Gelman & Rubin, 1992). 
For good introductions to Bayesian inference using MCMC with a focus on cognitive science applications, the reader is referred to the tutorial papers by Sorensen, Hohenstein, and Vasishth (2015) and Vincent (2015) and textbooks by Kruschke (2011) and Lee and Wagenmakers (2014). For other examples of MCMC techniques applied to psychophysical data, see Fründ, Haenel, and Wichmann (2011); Kuss, Jäkel, and Wichmann (2004); Wallis et al. (2015); and Wallis, Taylor, Wallis, Jackson, & Bex (2014). 
Appendix 2: Analysis of gain parameter changes using Bayes factors
To provide a complementary analysis of the experimental manipulations on the gain parameter, we reduced the model posteriors to point estimates by taking the median11 for each subject in each condition, then conducted a Bayesian repeated-measures ANOVA using the free statistics software JASP (Love et al., 2015; Morey & Rouder, 2015; Rouder, Morey, Speckman, & Province, 2012). Note that because these results are based on point estimates from our multilevel model they depend on the assumptions made by that model. We used the default prior settings in JASP (version 0.7.1.12). The JASP analysis files are included in the online code repository (http://doi.org/10.5281/zenodo.34218). 
For Experiment 1, a 2 (context; surround vs. no surround) × 2 (diameter; 0.74 and 5.95) ANOVA (Table A2) was conducted (JASP output is summarized in Table A1).The model with the highest posterior probability is the full model, containing both main effects and the interaction term. The log Bayes factor for the interaction model over the additive model is 9.875 − 7.687 = 2.188, indicating that the data was about nine-to-one in favor of the inclusion of the interaction term (e2.188 ≈ 8.9). There was strong evidence in favor of a main effect of diameter: Compared to the null model, modeling the effect of patch size was favored 244-to-one. Marginalizing over the influence of surround strengthened this evidence, e7.687 − (−0.368) ≈ 3,150-to-one in favor of including patch size. The main effect of context was equivocal: Compared to the null model, there was weak evidence (approximately 1.4-to-one) for ignoring context. When the influence of patch size is also modeled, the simple effects of context become evident. The Bayes factor for the additive model (containing both effects) compared to modeling diameter only (ignoring context) is e7.687 − 5.498 ≈ 8.9, indicating evidence for including context effects to a model that accounts for the differences in patch diameter. Thus, the relatively weak influence of context depends on marginalizing out the variance associated with patch diameter, consistent with the interaction effect above. To summarize, these results suggest that asymptotic sensitivity is slightly higher in the no surround condition compared to the surround condition, higher for large patch diameters, and that the difference between context conditions depends on the patch diameter such that for small patches the surround has a relatively large influence (Figure 5). 
Table A1
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 1. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2%.
Table A1
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 1. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2%.
Table A2
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 2. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2.1%.
Table A2
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 2. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2.1%.
For Experiment 2, a 3 (context) × 2 (comparison type) Bayesian repeated-measures ANOVA (Table 2A) revealed revealed weak evidence against a main effect of context condition ignoring comparison (Bayes factor against null model is ≈0.43 or 2.33-to-one in favor of the null model), a strong main effect of comparison type (e13.612 ≈ 815,000-to-one in favor of including comparison type compared to the null), and weak evidence against an interaction between comparison type and context (e16.036 − 15.02 ≈ 2.76-to-one in favor of the additive model). Observers were more sensitive when comparing natural images to synthesized images than synthesized images to each other, but there was little evidence for an influence of context on the results (Figure 9). 
Appendix 3: Scene scale versus patch size
One noticable feature of the data in Experiment 2 is the drop in performance at large patch sizes in the synth versus synth condition. That is, performance is poor for small sizes, rises as size increases, and then deteriorates once more for the largest sizes but only in the synth versus synth case. 
We speculated that this peak in performance was caused by an interaction between the size of the texture synth pooling regions (the patch size) and the scale of image structure in the scenes. A small image patch cropped from a natural image will tend to contain little structure due to the sparseness of edges in many scenes. Such patches are often noisy and “texture-like,” which is well suited to appearance matching by the texture synth model. Patches at middle scales tend to contain sparse edge structure. Although the texture synth representation can match the appearance of an isolated edge, it does not precisely match the position or orientation of such features. Such synthesized patches are easily discriminable from natural patches and from each other. The largest patch sizes used in Experiment 2 often contain many edges and features because they encompass nearly a whole scene. The large amount of image structure in the source patches can be jumbled together into texture-like syntheses by the texture synth representation. These textures are hard to tell apart from each other, but the original scenes are easily discriminated. 
To test this hypothesis, we created smaller patches by cropping and synthesizing large image patches and then down-sampling them. This keeps the average scale of scene content constant for each patch size to the extent that such scene content can be rendered in fewer pixels. If performance depends on the scale of image structure in the scenes, performance should no longer peak at a middle scale factor in the synth versus synth condition. 
As in Experiment 2, we randomly selected 700 images from the MIT1003 database with a minimum side length of 512 pixels and cropped a 512 × 512 patch from each image's center. These image patches were gray scaled and normalized as above. Each image was randomly assigned to one of the seven patch sizes from Experiment 2. The images in each patch size condition were not all the same as in Experiment 2. We generated five unique syntheses from each image patch, all at 512 × 512 pixels. The original and synthesized images were then down-sampled using the transform.resize function from scikit-image with default settings (linear interpolation for down-sampling) to produce patches of the assigned size. This procedure dissociates scene content and image features from the size at which they are displayed. 
We interleaved synth versus synth and nat versus synth trials as in Experiment 2, and the no surround condition was identical. Rather than cropping image patches from neighboring regions of the scene as in Experiment 2, here we used the two additional synthesized images as the inner and outer surrounds. That is, here the surround patches were always synthesized from the same source patch as the target, always abutted the borders of the target patch, and always shown at the same size as the target (recall that in Experiment 2 the surround patches were always 256 pixels square). Because this procedure caused the inner surround to overlap the fixation spot at the two largest patch sizes, we only measured the smallest five patch sizes in the surround condition. 
We quantified the critical size by fitting the data as in Experiment 2. Fixation breaks and blinks resulted in discarding 0.8% of trials from S1 and 24.8% of trials for S2. Eleven trials were discarded for timing errors. The final data set consisted of 2,380 trials from S1 and 1,803 trials from S2. 
The results (see Figure A1) show no evidence of the deterioration in performance at large patch sizes observed in Figure 8. Note that performance at the largest patch size in this experiment is higher for both subjects than in Figure 8; we believe this is likely a practice effect. 
Figure A1
 
Discrimination performance for stimuli created by down-sampling the largest patch size (see text), for two observers (rows) in two surround conditions (columns). Blue data points and curves show the natural versus synthetic comparison; green denotes the synth versus synth comparisons. Descriptions of all plot elements as in Figure 8. Performance asymptotes at large patch sizes rather than deteriorating as in Figure 8, suggesting that there is an interaction between patch size and the scale of objects in the scene.
Figure A1
 
Discrimination performance for stimuli created by down-sampling the largest patch size (see text), for two observers (rows) in two surround conditions (columns). Blue data points and curves show the natural versus synthetic comparison; green denotes the synth versus synth comparisons. Descriptions of all plot elements as in Figure 8. Performance asymptotes at large patch sizes rather than deteriorating as in Figure 8, suggesting that there is an interaction between patch size and the scale of objects in the scene.
We suggest that the reason for the patch size tuning apparent in Figure 8 is that the intermediate patch sizes contain sparse features (single edges) that the texture synth algorithm matches but can assign positions anywhere in the image patch (i.e., the features are shift invariant). Two synthesized images are easy to tell apart on average because they tend not to place these features at the same location. Down-sampling larger patches as we do here ensures that the texture synth algorithm has many features to jumble into texture-like images, which are equally difficult to discriminate from one another at any size. Discriminating these images from their source patches remains easier at all patch sizes. 
Appendix 4: Influence of surrounding image information on stimulus generation
One consideration regarding our blur experiments is that information from outside the crop region will influence the physical content of the blurred image patch. Blurring the image with a Gaussian kernel and then cropping a subsection, as we do in Experiment 1, means that pixel values in the cropped region depend on pixel values outside the cropped region to an extent that depends on the standard deviation of the blur kernel. This creates an additional cue at the patch borders that observers could use to discriminate the blurred images from the unmodified images. Although this factor could account for the high sensitivity of humans in our experiment, we believe this is unlikely. Consider that the images are discriminable with d′ ≈ 3 for standard deviations of about 2 pixels. If we assume that border effects extend no more than 3 SD from the edge of the patch, then the influence of external image structure on the filtering in these images is at most 6 pixels, and many of these pixels were presented at low contrast due to the spatial cosine window of our stimuli (extending for 10 pixels). We assert that these low-contrast pixels at the patch borders are unlikely to support a d′ ≈ 3 boost in discrimination performance; rather, observers are indeed sensitive to blurring of the patch content. 
We believe this effect is even more negligible for Experiment 2. In this experiment, all synthesized image patches (except for the points with black borders in Figure 8) were generated from precropped patches. That is, they contained the same features as the original image patches with the exception that features outside the cosine window in the original images could appear in the texture images. That cropping out a large proportion of the patch center (points with black borders in Figure 8) does not consistently alter performance suggests that the influence of our spatial window on discriminability will be minimal; this was confirmed in a control experiment in which we found similar performance for unwindowed square patches. 
Appendix 5: Performance for individual images
Our data analysis in Experiments 1 and 2 relies on performance averages across a number of unique images. To what degree does performance depend on the individual image? Because we have not attempted to match images in terms of low-level features, such as edge density, it is likely that some images are consistently easier than others due to their structure. 
We examine this variation for the data from Experiment 1 by plotting the performance for each image, averaged across observers and some conditions (Figure A2). In this plot, image number is assigned in each facet separately by sorting images by average performance in that plot facet. Note that this means the image with a given number in one plot column is not necessarily the same image as in the other plot column. Because different patch sizes used different images, the images are all different from the first row to the second row. Note also that we apply a rule-of-succession correction to the data to allow an assessment of uncertainty for all images. This is why there are no data points at zero or one (because the rule-of-succession adds one success and one failure to each binomial data point). 
Figure A2
 
Performance on individual images in Experiment 1, averaged across observers. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Blur levels have been split into two categories: SDs below 8 pixels (“low”) and 8 pixels and above (“high”). Data are faceted by the patch size (0.74° or 5.95°, “small” and “large,” respectively) and surround condition. The image number (0–99) is assigned by average performance in each plot facet (i.e., averaging over blur level).
Figure A2
 
Performance on individual images in Experiment 1, averaged across observers. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Blur levels have been split into two categories: SDs below 8 pixels (“low”) and 8 pixels and above (“high”). Data are faceted by the patch size (0.74° or 5.95°, “small” and “large,” respectively) and surround condition. The image number (0–99) is assigned by average performance in each plot facet (i.e., averaging over blur level).
In general, some images produce performance near chance whereas for others performance is reliably above chance. Consider, for example, the large patch condition with no surround. For a few images, performance was not distinguishable from chance even for the highest blur levels whereas other images yielded good performance even for low blur levels. Example images that were easy and hard, respectively, are shown in Figure A3. Easy images tended to contain edges, text, faces, and other image structure that was often in sharp focus. Hard images are relatively featureless, consisting of coarse gradients or blurry textures. Conversely, consider the small patch condition with surrounds. Although on average this condition was essentially impossible at all blur levels, there appear to be a small number of images for which these discriminations were in fact possible. Thus, our results (e.g., Figure 3) and conclusions must be considered with the caveat that what we find on average will not necessarily hold for individual images (see also Sebastian et al., 2015). 
Figure A3
 
Example images from Experiment 1 that were easy (A) and hard (B) to discriminate from blurred versions in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images tend to contain edges and other structure whereas hard images tend to consist of coarse gradients or blurry textures.
Figure A3
 
Example images from Experiment 1 that were easy (A) and hard (B) to discriminate from blurred versions in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images tend to contain edges and other structure whereas hard images tend to consist of coarse gradients or blurry textures.
The same caveat holds true for Experiment 2. In Figure A4, we show the rule-of-succession-corrected performance for images in the largest and smallest patch sizes in the nat versus synth comparison condition, averaged across observers and context conditions. Although average performance at the smallest patch size was near chance, it can be seen that some images were easier than this. Similarly, although performance in the largest patch size is near ceiling, several images show near-chance performance. InFigure A5, we present examples of images that are easy and hard to tell apart from texture-synth-matched synthetic images. Easy images tend to be inhomogeneous, and consequently hard for the texture synth features to capture, whereas texture-like natural images are hard to discriminate from synthesized images. 
Figure A4
 
Performance on individual images in Experiment 2, for two patch sizes in the nat versus synth comparison condition, averaged across observers and context conditions. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Data are faceted by the patch size (0.74° or 11.91°). The image number is assigned by average performance in each plot facet.
Figure A4
 
Performance on individual images in Experiment 2, for two patch sizes in the nat versus synth comparison condition, averaged across observers and context conditions. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Data are faceted by the patch size (0.74° or 11.91°). The image number is assigned by average performance in each plot facet.
In future work, it would be useful to expand the model fitting presented here to include random effects not only for different observers (as we do here) but also for different images. In this way, the variance in performance attributable to different images can be accounted for in the model fits (Nuthmann & Einhäuser, 2015; see Sorensen et al., 2015, for a tutorial example of this approach applied to data in linguistics). 
Appendix 6: Empirical saliency
The experiments in this paper used images from Judd et al. (2009), a study of fixation placement in scene viewing. As noted above, performance can depend considerably on the particular image, and difficult images tended to be those containing little edge structure (Experiment 1, Figure A3) or textures (Experiment 2, Figure A5). On the assumption that people look at “things” rather than “stuff” in images (e.g., Einhäuser, Spain, & Perona, 2008), we thought it would be interesting to consider the correlation between performance and the empirical saliency (kernel smoothed eye fixation density) of the image patches. 
Figure A5
 
Example images from Experiment 2 that are easy (A) and hard (B) to discriminate from texture-synth-matched examples in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images are inhomogeneous whereas hard images tend to be texture-like.
Figure A5
 
Example images from Experiment 2 that are easy (A) and hard (B) to discriminate from texture-synth-matched examples in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images are inhomogeneous whereas hard images tend to be texture-like.
We computed the mean of the kernel-smoothed fixation map (provided online by Judd et al., 2009) for each image patch, then transformed them into z scores for each patch size (because the absolute values depend on the patch size). Higher z scores correspond to image patches that were fixated more frequently by observers in the experiment of Judd et al. (2009). There was no evidence for a relationship between fixation density and the discriminability of individual images, averaging performance over all observers and conditions in Experiment 1 (Figure A6; similar results were found for Experiment 2). Although this coarse analysis does not preclude an influence of empirical saliency once other effects are statistically controlled, it suggests that any influence is unlikely to be strong. 
Figure A6
 
The relationship between normalized fixation density (“empirical saliency”) and performance for the 200 individual images in Experiment 1, averaged over observers and task conditions. Fixation densities were normalized within patch size conditions because the absolute density values depended strongly on patch size. There is no evidence for a relationship between saliency and the discriminability of image patches; similar results were found for each patch size individually and for Experiment 2.
Figure A6
 
The relationship between normalized fixation density (“empirical saliency”) and performance for the 200 individual images in Experiment 1, averaged over observers and task conditions. Fixation densities were normalized within patch size conditions because the absolute density values depended strongly on patch size. There is no evidence for a relationship between saliency and the discriminability of image patches; similar results were found for each patch size individually and for Experiment 2.
Figure 1
 
Model predictions and metamerism. (A) Eight physically different images lie at different points in pixel space (in which each dimension corresponds to the intensity of a pixel; two pixels are shown). (B) When expressed in model space (in which dimensions are model parameters), the images lie on two points; that is, they have identical model responses. The images linked by the yellow curve in pixel space have one model response; the images linked by the blue curve have another. If the model corresponds to perception, then images with identical model responses should appear the same despite being physically different. Two of the images above are natural scene patches, and the rest are synthesized texures generated to match those patches.
Figure 1
 
Model predictions and metamerism. (A) Eight physically different images lie at different points in pixel space (in which each dimension corresponds to the intensity of a pixel; two pixels are shown). (B) When expressed in model space (in which dimensions are model parameters), the images lie on two points; that is, they have identical model responses. The images linked by the yellow curve in pixel space have one model response; the images linked by the blue curve have another. If the model corresponds to perception, then images with identical model responses should appear the same despite being physically different. Two of the images above are natural scene patches, and the rest are synthesized texures generated to match those patches.
Figure 2
 
Experiment 1 stimuli and task. (A) Three patches were cropped from a source image. The middle patch (blue circle) was always the target to be discriminated. This image depicts a target subtending 0.74° in diameter. The surround patches (inner and outer; orange circles) abutted the target patch and always subtended ≈6°. The middle patch was always centered at 10° to the right of the fixation spot. The angle between the patches and the fixation spot was randomized in each interval to make motion cues uninformative (white dashed arrows; see text). (B) Illustration of the second target size condition (subtending ≈6°). (C) Different blur kernels (standard deviation in pixels) applied to the target patch from (B). (D) Procedure. Gray rectangle depicts the monitor; central white point is the fixation spot. The stimulus was presented with the middle (target) patch centered 10° in the right visual field. One interval contained a physically different target patch to the other two; observers indicated which interval contained the oddball. In this example, the blurred image is the oddball and occurs in interval 3. (E) Depiction of the surround condition. Only the right half of the monitor and the three stimulus intervals are shown to improve visibility. The surround patches are the same image in each interval. In this example, the oddball image is the unmodified patch and occurs in the second interval.
Figure 2
 
Experiment 1 stimuli and task. (A) Three patches were cropped from a source image. The middle patch (blue circle) was always the target to be discriminated. This image depicts a target subtending 0.74° in diameter. The surround patches (inner and outer; orange circles) abutted the target patch and always subtended ≈6°. The middle patch was always centered at 10° to the right of the fixation spot. The angle between the patches and the fixation spot was randomized in each interval to make motion cues uninformative (white dashed arrows; see text). (B) Illustration of the second target size condition (subtending ≈6°). (C) Different blur kernels (standard deviation in pixels) applied to the target patch from (B). (D) Procedure. Gray rectangle depicts the monitor; central white point is the fixation spot. The stimulus was presented with the middle (target) patch centered 10° in the right visual field. One interval contained a physically different target patch to the other two; observers indicated which interval contained the oddball. In this example, the blurred image is the oddball and occurs in interval 3. (E) Depiction of the surround condition. Only the right half of the monitor and the three stimulus intervals are shown to improve visibility. The surround patches are the same image in each interval. In this example, the oddball image is the unmodified patch and occurs in the second interval.
Figure 3
 
Experiment 1 results and model fits for three observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of the standard deviation of the blur kernel (pixels). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Patch size (diameter in degrees) is coded in green for the large patch and blue for the small patch condition. Dashed gray line shows chance performance (0.33). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Observers can discriminate images with little high spatial frequency loss from natural images with above chance accuracy, and large patches are easier than small patches.
Figure 3
 
Experiment 1 results and model fits for three observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of the standard deviation of the blur kernel (pixels). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Patch size (diameter in degrees) is coded in green for the large patch and blue for the small patch condition. Dashed gray line shows chance performance (0.33). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Observers can discriminate images with little high spatial frequency loss from natural images with above chance accuracy, and large patches are easier than small patches.
Figure 4
 
Parameter estimates for Experiment 1 model fits. (A) Estimates of the critical blur parameter for each observer in each experimental condition. Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the medians of prior and posterior distributions. Note that the prior intervals extend to 16; the y-axis has been truncated to show the posteriors more clearly. Critical blur estimates are difficult to interpret due to a floor effect (see Figure 11 for more precise measurement of metamerism in the no surround large patch condition). (B) As for (A) for the maximum sensitivity parameter (in d′ units). Larger patches have higher asymptotic performance than small patches, and the addition of surrounds slightly reduces gain. (C) As for (A) for the slope parameter.
Figure 4
 
Parameter estimates for Experiment 1 model fits. (A) Estimates of the critical blur parameter for each observer in each experimental condition. Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the medians of prior and posterior distributions. Note that the prior intervals extend to 16; the y-axis has been truncated to show the posteriors more clearly. Critical blur estimates are difficult to interpret due to a floor effect (see Figure 11 for more precise measurement of metamerism in the no surround large patch condition). (B) As for (A) for the maximum sensitivity parameter (in d′ units). Larger patches have higher asymptotic performance than small patches, and the addition of surrounds slightly reduces gain. (C) As for (A) for the slope parameter.
Figure 5
 
Difference in gain (maximum performance in d′ units) between the no surround and surround conditions. Computed by subtracting the surround from the no surround condition; positive scores mean higher sensitivity in the no surround condition. Patch sizes are colored as in Figure 3. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is reliably higher in the no surround condition than the surround condition for small patches but not for large patches.
Figure 5
 
Difference in gain (maximum performance in d′ units) between the no surround and surround conditions. Computed by subtracting the surround from the no surround condition; positive scores mean higher sensitivity in the no surround condition. Patch sizes are colored as in Figure 3. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is reliably higher in the no surround condition than the surround condition for small patches but not for large patches.
Figure 6
 
Change in spectral power caused by blurring in Experiment 1 for 256-pixel square images. Log amplitude (base 10, arbitrary units) as a function of spatial frequency (cycles per degree) for unmodified images and three blur levels chosen according to perceptual discriminability (see legend). In addition, we examined the attenuation caused by processing our stimuli using the Geisler and Perry (1998) foveated blurring model (red curve). Amplitudes are averaged over images and orientations. The light gray shaded region shows the frequency range corresponding to Geisler and Perry's critical frequency cutoff for the inner and outer borders of our stimuli. The darker gray shaded region denotes frequencies that should be undetectable in our experiment. These comparisons suggest the Geisler and Perry foveated stimulus blur would be readily discriminable (d′ > 3) by observers in our experiment.
Figure 6
 
Change in spectral power caused by blurring in Experiment 1 for 256-pixel square images. Log amplitude (base 10, arbitrary units) as a function of spatial frequency (cycles per degree) for unmodified images and three blur levels chosen according to perceptual discriminability (see legend). In addition, we examined the attenuation caused by processing our stimuli using the Geisler and Perry (1998) foveated blurring model (red curve). Amplitudes are averaged over images and orientations. The light gray shaded region shows the frequency range corresponding to Geisler and Perry's critical frequency cutoff for the inner and outer borders of our stimuli. The darker gray shaded region denotes frequencies that should be undetectable in our experiment. These comparisons suggest the Geisler and Perry foveated stimulus blur would be readily discriminable (d′ > 3) by observers in our experiment.
Figure 7
 
Experiment 2 stimuli (procedure as in Experiment 1). (A) Three patches were cropped from a source image. The middle patch was always the target to be discriminated and varied in size (concentric blue/green/yellow colored circles), white labels show diameter in degrees. The surround patches (inner and outer; orange circles) shifted position so that they always abutted the middle patch (orange arrows). Here they are shown at their largest distance, such that the largest two target patches were partially occluded by the surrounds. The middle patch was always centered at 10° to the right of the fixation spot. (B) Natural inner, middle, and outer patches cropped from the example image in (A). (C) Synthetic patches matching those shown in (B) under the Portilla and Simoncelli (2000) texture representation. (D) A depiction of a nat versus synth trial with no surround. (D–G) Depictions of four experimental conditions (six were run in total; see text). The correct response in all examples in this figure is “interval 3.”
Figure 7
 
Experiment 2 stimuli (procedure as in Experiment 1). (A) Three patches were cropped from a source image. The middle patch was always the target to be discriminated and varied in size (concentric blue/green/yellow colored circles), white labels show diameter in degrees. The surround patches (inner and outer; orange circles) shifted position so that they always abutted the middle patch (orange arrows). Here they are shown at their largest distance, such that the largest two target patches were partially occluded by the surrounds. The middle patch was always centered at 10° to the right of the fixation spot. (B) Natural inner, middle, and outer patches cropped from the example image in (A). (C) Synthetic patches matching those shown in (B) under the Portilla and Simoncelli (2000) texture representation. (D) A depiction of a nat versus synth trial with no surround. (D–G) Depictions of four experimental conditions (six were run in total; see text). The correct response in all examples in this figure is “interval 3.”
Figure 8
 
Experiment 2 results and model fits for four observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of patch size (diameter in degrees). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Points with black borders show patch size levels created by down-sampling, which do not have precisely matched statistics (see text). Dashed gray line shows chance performance (0.33). The comparisons between natural and synthesized images are in blue (with squares) and the synthesized versus synthesized comparisons are shown in green (with circles). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Discriminating natural from synthesized images is easier than synthesized images against other synthesized images for large patch sizes; performance in the two conditions was similar (but still largely greater than chance) for the smallest patch sizes we tested. Adding surrounds reduced asymptotic performance.
Figure 8
 
Experiment 2 results and model fits for four observers. The panels are arranged with observers in rows and surround conditions in columns (see labels above panels). Each panel shows discrimination performance as a function of patch size (diameter in degrees). Points show the proportion correct; error bars show 95% beta distribution confidence intervals. Points with black borders show patch size levels created by down-sampling, which do not have precisely matched statistics (see text). Dashed gray line shows chance performance (0.33). The comparisons between natural and synthesized images are in blue (with squares) and the synthesized versus synthesized comparisons are shown in green (with circles). Curves show 100 predictions drawn from the posterior distribution over model parameters. More faint regions show areas of higher uncertainty. Discriminating natural from synthesized images is easier than synthesized images against other synthesized images for large patch sizes; performance in the two conditions was similar (but still largely greater than chance) for the smallest patch sizes we tested. Adding surrounds reduced asymptotic performance.
Figure 9
 
Parameter estimates for Experiment 2. (A) Estimates of the critical patch size parameter for each observer in each experimental condition (surround condition in panel columns). Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the median. Note that the prior intervals extend to 12; the y-axis has been truncated to show the posteriors more clearly. Critical size estimates are difficult to interpret due to a floor effect. (B) As for (A) but for the maximum sensitivity parameter (in d′ units). Nat versus synth yields higher asymptotic performance than synth versus synth. (C) As for (A) but for the slope parameter.
Figure 9
 
Parameter estimates for Experiment 2. (A) Estimates of the critical patch size parameter for each observer in each experimental condition (surround condition in panel columns). Violin plots show the 95% credible distributions of the posterior (darker foreground colors) and prior (lighter background colors). Horizontal black lines show the median. Note that the prior intervals extend to 12; the y-axis has been truncated to show the posteriors more clearly. Critical size estimates are difficult to interpret due to a floor effect. (B) As for (A) but for the maximum sensitivity parameter (in d′ units). Nat versus synth yields higher asymptotic performance than synth versus synth. (C) As for (A) but for the slope parameter.
Figure 10
 
Difference in gain (maximum performance in d′ units) between surround conditions. Comparisons shown in facet titles. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is generally higher in the no surround condition than the natural surround condition for both comparison types; there is weaker evidence for any difference between the no surround and synthetic surround condition or the surround conditions from each other.
Figure 10
 
Difference in gain (maximum performance in d′ units) between surround conditions. Comparisons shown in facet titles. Violin plots show the 95% credible distributions of the posterior difference score in the gain parameters. Solid gray line shows zero difference. Asymptotic performance is generally higher in the no surround condition than the natural surround condition for both comparison types; there is weaker evidence for any difference between the no surround and synthetic surround condition or the surround conditions from each other.
Figure 11
 
Gaussian blur can produce convincing metamers. (A) Changes in spectral content produced by six blur levels (Gaussian standard deviation in pixels) and the average amplitude spectrum of the unmodified image patches (100 images, 256 pixels square). All blur levels cause physical modifications to image content. (B) Proportion correct as a function of the blur levels in (A) (Gaussian standard deviation in pixels; note logarithmic x-axis) for two observers. Error bars show 95% beta distribution confidence intervals, rule-of-succession corrected. Each data point represents at least 100 trials (S1 has done 200 for the four lowest blurs). The smallest blur levels are indiscriminable from unmodified images despite being physically different (i.e., they are metamers).
Figure 11
 
Gaussian blur can produce convincing metamers. (A) Changes in spectral content produced by six blur levels (Gaussian standard deviation in pixels) and the average amplitude spectrum of the unmodified image patches (100 images, 256 pixels square). All blur levels cause physical modifications to image content. (B) Proportion correct as a function of the blur levels in (A) (Gaussian standard deviation in pixels; note logarithmic x-axis) for two observers. Error bars show 95% beta distribution confidence intervals, rule-of-succession corrected. Each data point represents at least 100 trials (S1 has done 200 for the four lowest blurs). The smallest blur levels are indiscriminable from unmodified images despite being physically different (i.e., they are metamers).
Figure A1
 
Discrimination performance for stimuli created by down-sampling the largest patch size (see text), for two observers (rows) in two surround conditions (columns). Blue data points and curves show the natural versus synthetic comparison; green denotes the synth versus synth comparisons. Descriptions of all plot elements as in Figure 8. Performance asymptotes at large patch sizes rather than deteriorating as in Figure 8, suggesting that there is an interaction between patch size and the scale of objects in the scene.
Figure A1
 
Discrimination performance for stimuli created by down-sampling the largest patch size (see text), for two observers (rows) in two surround conditions (columns). Blue data points and curves show the natural versus synthetic comparison; green denotes the synth versus synth comparisons. Descriptions of all plot elements as in Figure 8. Performance asymptotes at large patch sizes rather than deteriorating as in Figure 8, suggesting that there is an interaction between patch size and the scale of objects in the scene.
Figure A2
 
Performance on individual images in Experiment 1, averaged across observers. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Blur levels have been split into two categories: SDs below 8 pixels (“low”) and 8 pixels and above (“high”). Data are faceted by the patch size (0.74° or 5.95°, “small” and “large,” respectively) and surround condition. The image number (0–99) is assigned by average performance in each plot facet (i.e., averaging over blur level).
Figure A2
 
Performance on individual images in Experiment 1, averaged across observers. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Blur levels have been split into two categories: SDs below 8 pixels (“low”) and 8 pixels and above (“high”). Data are faceted by the patch size (0.74° or 5.95°, “small” and “large,” respectively) and surround condition. The image number (0–99) is assigned by average performance in each plot facet (i.e., averaging over blur level).
Figure A3
 
Example images from Experiment 1 that were easy (A) and hard (B) to discriminate from blurred versions in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images tend to contain edges and other structure whereas hard images tend to consist of coarse gradients or blurry textures.
Figure A3
 
Example images from Experiment 1 that were easy (A) and hard (B) to discriminate from blurred versions in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images tend to contain edges and other structure whereas hard images tend to consist of coarse gradients or blurry textures.
Figure A4
 
Performance on individual images in Experiment 2, for two patch sizes in the nat versus synth comparison condition, averaged across observers and context conditions. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Data are faceted by the patch size (0.74° or 11.91°). The image number is assigned by average performance in each plot facet.
Figure A4
 
Performance on individual images in Experiment 2, for two patch sizes in the nat versus synth comparison condition, averaged across observers and context conditions. Points show mean proportion correct and 95% beta distribution confidence intervals after a rule-of-succession correction. Data are faceted by the patch size (0.74° or 11.91°). The image number is assigned by average performance in each plot facet.
Figure A5
 
Example images from Experiment 2 that are easy (A) and hard (B) to discriminate from texture-synth-matched examples in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images are inhomogeneous whereas hard images tend to be texture-like.
Figure A5
 
Example images from Experiment 2 that are easy (A) and hard (B) to discriminate from texture-synth-matched examples in the largest patch size condition. The number below each image gives its rule-of-succession corrected proportion correct performance. Easy images are inhomogeneous whereas hard images tend to be texture-like.
Figure A6
 
The relationship between normalized fixation density (“empirical saliency”) and performance for the 200 individual images in Experiment 1, averaged over observers and task conditions. Fixation densities were normalized within patch size conditions because the absolute density values depended strongly on patch size. There is no evidence for a relationship between saliency and the discriminability of image patches; similar results were found for each patch size individually and for Experiment 2.
Figure A6
 
The relationship between normalized fixation density (“empirical saliency”) and performance for the 200 individual images in Experiment 1, averaged over observers and task conditions. Fixation densities were normalized within patch size conditions because the absolute density values depended strongly on patch size. There is no evidence for a relationship between saliency and the discriminability of image patches; similar results were found for each patch size individually and for Experiment 2.
Table A1
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 1. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2%.
Table A1
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 1. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2%.
Table A2
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 2. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2.1%.
Table A2
 
Bayesian repeated-measures ANOVA for the median gain parameter, Experiment 2. Notes: For each model, columns give the prior probability P(M), the posterior P(M|data), and the log of the model Bayes factor with respect to the null model. All models include subject as an additional factor. Error in estimating the BFs numerically were below 2.1%.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×