Free
Article  |   February 2015
The use of higher-order statistics in rapid object categorization in natural scenes
Author Affiliations
Journal of Vision February 2015, Vol.15, 4. doi:https://doi.org/10.1167/15.2.4
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Hayaki Banno, Jun Saiki; The use of higher-order statistics in rapid object categorization in natural scenes. Journal of Vision 2015;15(2):4. https://doi.org/10.1167/15.2.4.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

We can rapidly and efficiently recognize many types of objects embedded in complex scenes. What information supports this object recognition is a fundamental question for understanding our visual processing. We investigated the eccentricity-dependent role of shape and statistical information for ultrarapid object categorization, using the higher-order statistics proposed by Portilla and Simoncelli (2000). Synthesized textures computed by their algorithms have the same higher-order statistics as the originals, while the global shapes were destroyed. We used the synthesized textures to manipulate the availability of shape information separately from the statistics. We hypothesized that shape makes a greater contribution to central vision than to peripheral vision and that statistics show the opposite pattern. Results did not show contributions clearly biased by eccentricity. Statistical information demonstrated a robust contribution not only in peripheral but also in central vision. For shape, the results supported the contribution in both central and peripheral vision. Further experiments revealed some interesting properties of the statistics. They are available for a limited time, attributable to the presence or absence of animals without shape, and predict how easily humans detect animals in original images. Our data suggest that when facing the time constraint of categorical processing, higher-order statistics underlie our significant performance for rapid categorization, irrespective of eccentricity.

Introduction
Recent studies have clearly shown that humans can detect some types of objects such as animals in novel natural scenes with high accuracy and short response latency (Thorpe, Fize, & Marlot, 1996; Thorpe, Gegenfurtner, Fabre-Thorpe, & Bülthoff, 2001; VanRullen & Thorpe, 2001a, 2001b). Psychophysiological research has also revealed that an initial brain activation about 150 milliseconds after the image presentation onset was correlated with a behavioral judgment (Bacon-Macé, Macé, Fabre-Thorpe, & Thorpe, 2005; Thorpe et al., 1996), suggesting that enough visual processing to percept the presence or absence of an animal is achieved, presumably in a purely feed-forward manner (Thorpe et al., 1996; VanRullen & Koch, 2003). What features underlie such efficient categorical judgment and how different types of features are used in different visual fields are questions that immediately arise. Past studies have debated the role of shape and statistical information in visual perception (Balas, Nakano, & Rosenholtz, 2009; Biederman, 1987; Oliva & Torralba, 2001). We tested their relative contribution on a single experimental paradigm. We used higher-order statistics proposed by Portilla & Simoncelli (2000; hereafter referred to as P-S statistics) and dissociated this information from the global shape. First, we briefly review the literature of visual categorization and then define the goal of the study. 
The global shape of an object embedded in a scene is a well-known clue for categorization. In the current work, we operationally define the term “shape” as closed contours that are possessed by meaningful and familiar objects in the real world. The contours of the real-world objects are formed by a long-range association of local features. Under our definition, the shape is logically something global. Our visual system is designed to filter the visual input and reconstruct the world from such building blocks obtained by different filters (hierarchical approach). Since Hubel and Wiesel (1959, 1968) discovered neuronal selectivity for an oriented bar/edge in the primary visual cortex, vision researchers have invested energy into modeling the stage through which filtered features are assembled and lead to highly complicated object representations (Biederman, 1987; Marr, 1982). The theory of recognition-by-components (RBC) is an excellent example of the hierarchical model. RBC encodes objects as clusters of three-dimensional parts, called geons, and their spatial arrangements. It should be noted that, to extract geons from natural complex scenes, it is necessary to implement contour integration and figure–ground segmentation. However, a number of models based on these processes rely on recurrent processing provided by feedback signals from higher to lower visual areas (e.g., Bullier, Hupe, James, & Girard, 2001, figure–ground segmentation; Grossberg & Mingolla, 1985, contour integration). In short, they could be too time consuming to be completed within a feed-forward sweep (Roelfsema, Lamme, Spekreijse, & Bosch, 2002; Serre, Oliva, & Poggio, 2007). Although these processes can be achieved, in theory, without recurrent processing (May & Hess, 2008; Rosenholtz, Twarog, Schinkel-Bielefeld, & Wattenberg, 2009; Supèr, Romeo, & Keil, 2010), neurophysiological studies support its involvement (Hupé et al., 1998; Scholte, Jolij, Fahrenfort, & Lamme, 2008; Supèr & Lamme, 2007). The fact that image parsing based on perceptual grouping in natural scenes is actually slower than animal/vehicle categorization (Korjoukov et al., 2012) also challenges the involvement of hierarchical processing. 
The use of statistics hidden in natural scenes could be an alternative way of object processing (statistical approach). Several works suggest that, based on a collection of simple visual statistics, we could perform a rapid visual analysis bypassing complex computation (Bergen & Julesz, 1983; Fei-Fei & Perona, 2005; Malik & Perona, 1990; Motoyoshi, Nishida, Sharan, & Adelson, 2007; Oliva & Torralba, 2001; Renninger & Malik, 2004; Torralba & Oliva, 2003). The approach inspires the proposal that humans rapidly recognize objects using some sort of statistics, without the time cost associated with image segmentation and contour grouping. 
Although the use of Fourier power spectrum on natural scene categorization has been discussed for a long time (Guyader, Chauvin, Peyrin, Hérault, & Marendaz, 2004; Kaping, Tzvetanov, & Treue, 2007; Torralba & Oliva, 2003), it might be insufficient to explain human object recognition in natural scenes. The Fourier transform converts an image to a sum of sinusoidal luminance waveforms that differ in frequencies, orientations, amplitudes, and phases. The squared waveform amplitude as a function of frequencies and orientation is referred to as the power spectrum and represents the global energy in the image. The power spectrum is also a measure of second-order statistics that evaluates the correlation between luminances of all possible pairs of positions (the power spectrum is the Fourier transform of the autocorrelation; Field, 1987, 1989; van der Schaaf & van Hateren, 1996). However, subsequent experimental data questioned the human use of this information (Gaspar & Rousselet, 2009; Wichmann, Braun, & Gegenfurtner, 2006; Wichmann, Drewes, Rosas, & Gegenfurtner, 2010). 
Higher-order statistics could be another candidate. It is represented by the correlation among the outputs of filters tuned to different positions, orientations and scales. It also evaluates the mutual relation between the luminance at three or more positions (Field, 1989; Reinhard, Pouli, & Cunningham, 2010; Thomson, 2001; van der Schaaf, 1998). In addition to the power spectrum, these statistics reflect the partial information encoded in the phase spectrum (i.e., the phase angle as a function of frequencies and orientations). The phase spectrum embodies how different frequencies in the image are spatially localized. Recent studies have demonstrated that higher-order statistics are closely related to our degraded perception in peripheral vision. Freeman and Simoncelli (2011) showed that participants found it difficult to discriminate a natural scene from a textural patch with jumbled features synthesized to match its local statistics in multiple visual areas, consistent with the earlier idea that the visual representation extracted from the peripheral image consists of the summary statistics over local pooling regions (Balas, Nakano, & Rosenholtz, 2009). In this model, the pooling regions tile the entire visual field and grow approximately with the distance from the fixation point. Balas and colleagues also demonstrated the same idea through different tasks. They found a correlation between the discriminability of a textural image patch that only captures the statistics in the peripheral visual field and the discriminability when recognizing crowded letters (Balas et al., 2009), searching targets (Rosenholtz, Huang, Raj, Balas, & Ilie, 2012b), and animal/vehicle categorization (Rosenholtz, Huang, & Ehinger, 2012a). 
The statistical models used in these studies (Freeman & Simoncelli, 2011; Rosenholtz, Huang, & Ehinger, 2012a; Rosenholtz, Huang, Raj et al., 2012b) are modified versions of what Portilla and Simoncelli (2000) originally proposed (P-S statistics). In the original model, the set of statistics was computed from a single pooling region corresponding to the whole image. In contrast, recent models employ multiple local pooling regions: the P-S statistics are computed within each small region relative to the whole image. Recent models logically include representation about where visual structures encoded by the statistics are located. Therefore, recent models include shape information. 
It is important to differentiate statistics from shape. The two terms refer to two different types of visual representation. The extraction of P-S statistics indicates a simple pooling process of local structures within a region. However, the extraction of shape information refers to a visual process that is an association of local features over a long range. Although these different kinds of information are not strictly orthogonal because the spatially organized connection of local structures ultimately forms shape information (i.e., the manipulation of local structures obviously affects their long-range association and alters the appearance of shape), there is concern about how a collection of local structures contributes to natural scene categorization (Fei-Fei & Perona, 2005; Renninger & Malik, 2004; Shotton, Blake, & Cipolla, 2008). The P-S statistics pooled from a single region (i.e., shape is not considered) successfully encode the information necessary for texture perception during a human behavioral task (Balas, 2006) and neuronal measurements (Freeman, Ziemba, Heeger, Simoncelli, & Movshon, 2013). Other research suggests the importance of these statistics in scene categorization (Loschky, Hansen, Sethi, & Pydimarri, 2010). 
While previous studies regarding the degraded perception in the peripheral vision have revealed that the perceptual properties are closely related to the statistical representation of visual inputs, these studies did not directly answer how the shape and statistical information contributes to the rapid object categorization. Notably, there is no evidence about how the relative contribution of these different types of information varies with eccentricity. This is for two reasons. First, recent models proposed by Freeman and Simoncelli (2011), Rosenholtz, Huang, and Ehinger (2012a), and Rosenholtz, Huang, Raj et al. (2012b) compute a coarse localization of statistical information using the spatially arranged pooling regions. The localized patches represent the global structure of a visual input; thus, shape clues are not isolated from statistics. Second, their models focus on an information bottleneck in early vision. They specify the information available, but not how our visual system uses it for high-level cognition tasks such as categorization. 
We formed a hypothesis that the relative contribution of statistical information depends on the availability of shape clues. Past studies exclusively used higher-order statistics to describe the pooled representation in the periphery, where it is difficult to obtain shape information owing to a lack of spatial alignment of elements. By extension, the visual system might be “obliged to” obtain a statistical representation because of the increase in receptive field size as a function of eccentricity (Balas et al., 2009; Lettvin, 1976; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001). We have already noted that the models by Freeman and Simoncelli (2011), Rosenholtz, Huang, and Ehinger (2012a), and Rosenholtz, Huang, Raj, et al. (2012b) tiled multiple pooling regions, growing with the distance from fixation point. In the models, the shape information availability in a specific part of visual field depends on the size of the pooling region because in the spot with tiny pooling regions tiled, local features are not so jumbled and their long-range associations are not reduced. It is possible that the statistical information is less valuable in central than in peripheral vision. In the extreme case of our hypothesis, statistics might not be used at all in the fovea with fine-scaled receptive fields. Meanwhile, there is the possibility that our hypothesis does not hold. Statistical computation has an advantage, even in this area, when facing the time constraints of categorization. 
In the current study, we investigated the role of shape and higher-order statistical information for object categorization. We also examined how their relative contributions vary with eccentricity. Our hypothesis raises two predictions: (a) shape information makes a greater contribution to central than to peripheral vision, and (b) statistical information makes a smaller contribution to central than to peripheral vision. The statistics defined by Portilla and Simoncelli (2000) were applied in our study. We tested an animal detection task, a typical experiment to assess ultrarapid categorization (Thorpe et al., 1996). To test two predictions in Experiments 1, 2, and 3, we manipulated the usefulness of the information in the task by changing the content of distracters. In addition, we observed how the statistics themselves contributed to the task performance (Experiment 4) and how they predict human performance by comparing human performance with that of a machine classifier (Experiment 5). 
The algorithm by Portilla and Simoncelli extracts responses of multiscale-oriented wavelets from an original image and computes the correlations of responses across different orientations, frequencies and positions. It then iteratively adjusts the statistics of the random white noise to the parameters of the original and finally synthesizes textural images having approximately the same statistics (Footnote 1 explains what “the same” means).1 The correlations between filters using wavelets visually capture the local structure of the original such as edges, contours, and repeated patterns of textures, but not the global configuration such as the overall shape of real objects. Therefore, the synthesized images seem to have the features of an original image, but are spatially disorganized (see Figure 1C, D). Although the synthesized images have their own shapes, they are too far from those of real objects. It is unlikely that we can match them to our knowledge in long-term memory. The synthesis procedure in our study is different to that of previous models (Freeman & Simoncelli, 2011; Rosenholtz, Huang, & Ehinger, 2012a; Rosenholtz, Huang, Raj et al., 2012b). In their studies, pooling regions of small size tiles, the visual field and shape information was retained to some extent in the visual image. In contrast, we applied a single region corresponding to the whole image. 
Figure 1
 
Original images and textures synthesized from them using the texture synthesis algorithm (Portilla & Simoncelli, 2000). Textures are aligned in the same order as the corresponding original images. All groups appeared in Experiments 1, 3, 4, and 5. (A) Original images containing animals are referred to as Animal images in this paper, and (B) original images without an animal are referred to as Non-Animal. (C) Textures synthesized from Animal images are referred to as Textures(Animal), and (D) textures from Non-Animal images are referred to as Texture(Non-Animal).
Figure 1
 
Original images and textures synthesized from them using the texture synthesis algorithm (Portilla & Simoncelli, 2000). Textures are aligned in the same order as the corresponding original images. All groups appeared in Experiments 1, 3, 4, and 5. (A) Original images containing animals are referred to as Animal images in this paper, and (B) original images without an animal are referred to as Non-Animal. (C) Textures synthesized from Animal images are referred to as Textures(Animal), and (D) textures from Non-Animal images are referred to as Texture(Non-Animal).
Experiment 1
Experiment 1 was a first step to test how shape and higher-order statistics are used for rapid animal detection. We paired Animal images containing one or more animals (as targets) with three types of images (as distracters): (a) textures having the same statistics as targets, referred to as Texture(Animal); (b) textures having the same statistics as images without animals, referred to as Texture(Non-Animal); and (c) original images without animals, referred to as Non-Animal. Comparing performances in these conditions enables us to estimate the degree of information contribution. We again note that, in the current study, the term shape refers to the form of well-known objects in the real world. 
A condition in which Non-Animal images were used was a typical animal detection task. We named it the Statistics+shape condition because both clues were intact compared with other conditions. Next, we called the condition in which Texture(Non-Animal) images were used the Statistics-only condition. The synthesis procedure renders object shapes in distracter images unavailable because long-range associations of local features are broken down through it. Participants cannot rely on their internal knowledge about the shape of objects to reject distracters. Thus, the usefulness of shape information can be measured by the comparison of both conditions. If the shape provides the information necessary for the detection, performances in the Statistics+shape condition will be higher than the Statistics-only condition. Our first prediction was that shape information makes a greater contribution to central than to peripheral vision. According to this prediction, the performance difference associated with whether the shape clue was used would be larger in central vision. Finally, the condition in which Texture(Animal) images were used was named the None condition. This name indicates that statistical information could not be used to discriminate between target and distracter images compared with the Statistics-only condition. Texture(Animal) images shared the statistics with the target Animal images, while Texture(Non-Animal) did not. If the statistics are critical, the Statistics-only condition should be more difficult than the None condition. Our second prediction was that statistical information makes a smaller contribution to central than peripheral vision. The difference associated with the usefulness of statistics would be larger in peripheral vision. 
The usefulness of statistics was tested in a different way to the way in which the usefulness of shape was tested. This is because the shape-only condition, in which the usefulness of statistical information is selectively removed, cannot be established. There is an asymmetrical relationship between shape and statistical information. The statistical information encoded by the P-S synthesis algorithm simply encodes the pooling of local structures, while shape information is critical for their spatial alignment. Shape information inevitably alters with the manipulation of statistical information. 
One might assume that contrasting performances of the typical animal detection task (Statistics+Shape condition) with performances when discriminating Texture(Animal) from Texture(Non-Animal) images would be a better experiment to make clear the shape contribution. However, these conditions might not be comparable. We would like to measure the effects of shape and statistics when participants engage in animal detection, a form of categorical decision. When they judge the presence/absence of animals, they need to have access to their internal knowledge about animals in the real world and match them to the visual input. The common target of Animal images makes sure that participants always refer to category knowledge about animals in performing the task, while the texture discrimination cannot guarantee whether participants perform categorical judgments on animals as in the typical animal detection task. On a conscious level, we seem to have difficulty in matching a textural appearance to our knowledge about animals (see Figure 1). Due to its difficulty, in texture discrimination, participants might rely more on a makeshift solution such as the search of face-like contrast pattern rather than the access to the knowledge in memory, even when the solution is worthless. It would complicate the interpretation of the performance difference between the conditions. This is why we assigned original images containing animals to targets through all conditions. 
Methods
Participants
Fifteen students from Kyoto University participated in the study (11 females, 4 males). All received 1,000 yen/hr for their participation. The experiment took about 2 hr. They gave informed consent prior to the start of the experiment and had normal or corrected-to-normal vision. 
Apparatus
Participants were seated in a dark room. Stimuli were displayed on a 17-inch Multiscan 17Se II CRT (Sony, Tokyo, Japan) with a screen resolution of 1024 × 768 pixels and a refresh rate of 75 Hz. The viewing distance was 39.5 cm. The screen was linearized with respect to luminance. The presentation of stimuli was controlled by MATLAB using the Psychophysics Toolbox Version 3 (Brainard, 1997; Pelli, 1997). The Cedrus RB-530 response box (Cedrus Corp., San Pedro, CA) was used to record the participants' responses. Because all experiments followed the same experimental setting, we abbreviate the Apparatus section in the following experiments. 
Stimuli
Figure 1 shows an example of the stimuli. First, 240 images were selected from the database previously used in Serre et al. (2007; Figure 1A, B). The database was originally collected in Torralba and Oliva (2003). Half of the images contained one or more animals (Animal images). These included various species such as mammals, birds, and fish, and could appear at any distance from the camera. The remaining 120 images contained no animal (Non-Animal images). The database included a few images that could mislead participants' judgments, such as Animal images that actually contained no animal and Non-Animal images that contained animals. We avoided using them. Likewise, some Non-Animal images that contained humans were excluded. 
Next, from these images, 240 texture images were synthesized, using Portilla and Simoncelli's texture synthesis algorithm (P-S algorithm; Figure 1C, D). Each texture approximately shared the higher-order statistics with its counterpart. We refer to the set of images synthesized from Animal and Non-Animal sets as Texture(Animal) and Texture(Non-Animal) images, respectively. All original and synthesized images were grayscale, equalized with respect to the mean luminance and RMS contrast of luminance, preventing low-level cues from affecting the task performance. The images subtended 5.44 × 5.44 degrees of visual angle. 
Additionally, 12 Animal and 12 Non-Animal images were used for practice sessions and 24 textures were generated from these images. These were never presented in the experimental sessions. 
Experimental design
A center of stimulus was positioned either at the fixation point at the center of the screen or 14° to the right. A stimulus was not presented on the left side in the peripheral field because our goal was to equalize spatial uncertainty between eccentricity conditions. 
In the experiment, three different groups of cues were assigned in different sessions: Statistics+shape, Statistics-only, and None. Target pictures were always Animal. Participants were instructed to discriminate target images from distracters. Each session included both eccentricity conditions. 
The experiment was designed to compare two pairs of usable cue types. Statistics+shape and Statistics-only conditions were compared to estimate how the loss of shape information lowered the performance. Non-Animal and Texture(Non-Animal) images shared the same statistics, but the latter had features consisting of shapes jumbled in an image. This paired comparison was a test of our first prediction. Additionally, a comparison between conditions in which two kinds of textures were used was conducted. Texture(Animal) in the None condition is dissociated from Texture(Non-Animal) in the Statistics-only condition in terms of the higher-order statistics. Participants could not use the statistical information as a potential cue when Texture(Animal) was set as a distracter since these images shared the statistics with the Animal group. This paired comparison was a test of our second prediction. Hence, the experiment mixed two independent 2 (eccentricity) × 2 (cue type) within-participant factorial designs. 
The performance was analyzed using A′, a measure of sensitivity based on signal detection theory (Pollack & Norman, 1964; Stanislaw & Todorov, 1999). A score of .5 indicates a chance-level performance, and the maximum value is 1.0. Mean reaction times (mean RTs) were also recorded to validate the absence of a speed–accuracy tradeoff. RTs from correct responses, such as hits and correct rejections, were analyzed. 
In the experimental session, an image was presented once at each eccentricity. Participants saw a target image six times and a distracter twice throughout the experiment. The order of three sessions was counterbalanced across participants. In a session, target and distracter images appeared with equal probabilities. In addition, 50% of the targets and distracters were presented at the center of the screen, and the rest were presented on the right side. Trials with targets and distracters in the two eccentricity conditions were randomly intermixed in each session. 
Procedure
Participants experienced three sessions, each of which was composed of 10 blocks of 48 trials. Before each session, a brief practice of 48 trials was performed. 
Figure 2A shows the schematic illustration of the sequence of a trial. The trial began with a green central fixation dot for 600 ms, followed by an image at the center or right side of the screen for 40 ms. Participants were asked to decide whether a briefly presented image was a target or a distracter and respond as quickly and accurately as possible by pressing the left of two buttons for a target and the right for a distracter. A response time deadline was not imposed. They were also instructed not to move their eyes from the fixation dot while it was presented. In practice sessions, sound feedback was given after each trial when an incorrect response occurred. They were required to take a break between sessions for at least 3 min. 
Figure 2
 
Trial schematics for experiments. (A) Schematic illustration for Experiments 1, 2, 4, and 5. (B) Schematic for Experiment 3, in which targets were masked and their exposure durations varied.
Figure 2
 
Trial schematics for experiments. (A) Schematic illustration for Experiments 1, 2, 4, and 5. (B) Schematic for Experiment 3, in which targets were masked and their exposure durations varied.
Results
The sensitivity scores of Experiment 1 are presented in Figure 3A. The results were analyzed by a 2 × 2 analysis of variance (ANOVA) for each pair of cue types. For the repeated-measure analyses in which the assumption of sphericity was violated, we used the Greenhouse-Geisser correction to adjust all degrees of freedom. We also report the generalized eta squared (ηG2) in the results as an index of effect size (Olejnik & Algina, 2003), expressing the proportion of variance accounted for by specific factors or their interactions. 
Figure 3
 
Experiment 1 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 3
 
Experiment 1 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
We compared conditions involved in shape information. The sensitivity when Texture(Non-Animal) images were used as distracters (Statistics-only condition) was lower than when Non-Animal were used (Statistics+shape condition). The main effects of eccentricity, F(1, 14) = 392.515, MSe = 0.0004, p < 0.0001, ηG2 = 0.670, and cue type, F(1, 14) = 8.680, MSe = 0.001, p = 0.011, ηG2 = 0.128, were observed. However, their performance impairment was comparable for both eccentricities, supported by the absence of an interaction effect, F(1, 14) = 0.038, MSe = 0.001, p = 0.849, ηG2 = 0.0003. Next, we compared conditions involved in statistical information. The impairment when comparing conditions in which textures were applied as distracters (Statistics-only vs. None conditions) was larger in the peripheral than central vision. Although an ANOVA showed significant main effects for eccentricity, F(1, 14) = 209.162, MSe = 0.001, p < 0.0001, ηG2 = 0.575, and cue type, F(1, 14) = 37.070, MSe = 0.001, p < 0.0001, ηG 2 = 0.184, the interaction was also significant, F(1, 14) = 21.075, MSe = 0.0004, p = 0.0004, ηG2 = 0.042. Further analysis indicated significant performance difference regardless of eccentricity: 0 deg, F(1, 14) = 10.632, MSe = 0.0006, p = 0.006, ηG2 = 0.060; 14 deg, F(1, 14) = 48.213, MSe = 0.0009, p < 0.0001, ηG2 = 0.343. 
The analysis of the mean response times (mean RTs) did not support the speed–accuracy tradeoff for each eccentricity (Figure 3B, C). Hit and correct-rejection responses were gathered, respectively. For each participant, outlier RTs outside two standard deviations from the mean RT were excluded from the analyses, which represented 4.8% (hit) and 4.1% (correct rejection) of the responses on average. For hits, neither the main effect of the cue type nor the interaction at any pair of distracters reached significance, ps > 0.185 (Figure 3B), whereas the main effect of eccentricity was significant, ps < 0.0001. For correct rejection, none of the effects were significant, ps > 0.082 (Figure 3C). 
Discussion
Current results support the prediction that humans rely on higher-order statistics in peripheral rather than central vision for the rapid animal detection task. However, the impairments themselves were observed at both eccentricity conditions, suggesting a new finding: that statistical perception is at work not only in peripheral but also in central vision. 
In contrast to the above result, we did not obtain data supporting our prediction about shape information. The drops in performance in Statistics-only versus Statistics+shape conditions were comparable between eccentricities. This result was surprising, given a recently reported fact that in peripheral vision, an original image and its texture synthesized by P-S statistics are indistinguishable (Freeman & Simoncelli, 2011), suggesting that even subjectively unextractable shape information contributes to categorization to a certain degree in the peripheral vision. It should be noted that the overall performance was high in the current experiment, and the higher contribution in central vision might be eliminated owing to the ceiling effect. We cannot decide its interpretation based solely on the current results. 
One might argue that the performance drop associated with whether the shape clue was used was caused by the task difference rather than the presence or absence of shape clues (Statistics+shape vs. Statistics-only). The statistics-only condition could be regarded as simple natural/synthesized image discrimination based on the visual pattern of coherence in images. However, the performance difference associated with the usefulness of statistics reduces this possibility. Both Texture(Non-Animal) in the Statistics-only condition and Texture(Animal) images in the None condition were synthesized images. The performance drop when Texture(Animal) images were used as distracters suggests that participants did more than merely detect the textured pattern of an image. Moreover, in terms of image coherence, Animal images should be better discriminated from Texture(Non-Animal) than Non-Animal images. It predicts the higher performance in Statistics-only than in Statistics+Shape condition. However, the performance difference showed the opposite pattern to the prediction. 
Experiment 2
In Experiment 2, we investigated whether the performance impairment in Experiment 1 actually reflected the difference of higher-order statistics. The P-S algorithm is designed to keep the second-order statistics such as the Fourier power spectrum along with the other P-S statistics (Loschky et al., 2010). Hence, the None condition is different from other conditions in terms of the power spectra. Although recent evidence has weakened the possibility of direct use of this type of information in animal detection (Wichmann et al., 2010), we carefully equalized the amplitude spectra between target and distracter images in the experiment. 
Methods
Participants
Fourteen students from Kyoto University participated in the study (10 female, 4 male). The experiment lasted about 2 hr. All received 1,000 yen/hr for their participation. They gave informed consent prior to the start of the experiment and had normal or corrected-to-normal vision. 
Stimuli
All images used in this experiment were generated from the same 240 images in Experiment 1. Fourier amplitude spectra were equalized. We first transformed each image and separated it into amplitude and phase spectra in the frequency domain. Then, we averaged the amplitude spectra across 240 original images. Finally, we combined the original phase spectrum of each image with the mean amplitude and generated the amplitude-equalized images by the inverse Fourier transformation. Examples of stimuli are shown in Figure 4A and B. In addition, the same number of texture images, Texture(Animal) and Texture(Non-Animal), were synthesized from the amplitude-equalized images (Figure 4C, D). Accordingly, all spectra of the original and synthesized images were equalized. All original and synthesized images had mean luminance and RMS contrast of luminance equalized. The images subtended 5.44 × 5.44 degrees of visual angle. 
Figure 4
 
Original images and textures used in Experiment 2. Textures are aligned in the same order as the corresponding original images. Animal (A) and Non-Animal (B) images were obtained by equalizing the Fourier amplitude spectra among a set of images used in Experiment 1. Texture(Animal) (C) and Texture(Non-Animal) (D) were derived from amplitude-equalized images that shared approximately the same spectra pattern.
Figure 4
 
Original images and textures used in Experiment 2. Textures are aligned in the same order as the corresponding original images. Animal (A) and Non-Animal (B) images were obtained by equalizing the Fourier amplitude spectra among a set of images used in Experiment 1. Texture(Animal) (C) and Texture(Non-Animal) (D) were derived from amplitude-equalized images that shared approximately the same spectra pattern.
For the practice sessions, 12 Animal and 12 Non-Animal amplitude-equalized images were used, and 24 textures were generated from these images. These were never presented in the experimental sessions. 
Experimental design and procedure
The sequence of this experiment was the same as in Experiment 1, with the exception that participants saw only images with equalized Fourier amplitudes. Again, our predictions were that (a) shape information makes a greater contribution to central than peripheral vision and (b) statistical information makes a smaller contribution to central than to peripheral vision. The experiment included two independent 2 (eccentricity) × 2 (cue type) within-participant factorial designs to verify the predictions. 
Results
Figure 5A shows sensitivities across conditions. We first discuss the test of our first prediction. The performance impairments when shape information was eliminated (Statistics+shape vs. Statistics-only condition) were not different between eccentricities. An ANOVA showed the main effects of eccentricity, F(1, 13) = 70.908, MSe = 0.005, p < 0.0001, ηG = 0.570, and cue type, F(1, 13) = 11.847, MSe = 0.002, p = 0.004, ηG = 0.074, but not an interaction effect, F(1, 13) = 0.407, MSe = 0.002, p = 0.535, ηG = 0.003. In the test of our second prediction, impairment was also found when statistical information was additionally eliminated. For the Statistics-only versus None conditions, the performance impairments were observed at both eccentricities. Both main effects were significant: eccentricity, F(1, 13) = 68.275, MSe = 0.006, p < 0.0001, ηG = 0.509; cue type, F(1, 13) = 8.727, MSe = 0.003, p = 0.011, ηG = 0.058. One difference from Experiment 1 was that the interaction did not reach significance, indicating that the impairment was not different between eccentricity conditions, F(1, 13) = 0.137, MSe = 0.002, p = 0.717, ηG = 0.001. 
Figure 5
 
Experiment 2 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 5
 
Experiment 2 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
We found an absence of the speed–accuracy tradeoff. Mean RTs are shown in Figure 5B and C. Outlier RTs were excluded in advance in the same manner as Experiment 1. The excluded RTs amounted to 5.1% (hit) and 4.2% (correct rejection), on average. Although an ANOVA for hit responses revealed only the main effects of eccentricity at any pair of cue types, ps < 0.0001, the effect of cue type and interaction did not reach significance, ps > 0.224. For correct rejections, no effects were significant, ps > 0.107. 
Discussion
Current results replicate the finding that not only peripheral but also central vision uses statistical information, even though the amplitude differences were lost. It is unlikely that the difference of Fourier power spectra completely explains this finding. However, the degree of impairment was the same regardless of eccentricities, unlike the previous experiment. One interpretation is that amplitude equalization might have caused the performance in the periphery to approach the floor level and hence eliminated the interaction effect. We discuss later why the overall performance impairment relative to Experiment 1 occurred in the General discussion
Results disprove the prediction that shape information makes a greater contribution to central vision. The absence of an eccentricity effect in performance impairment when the use of shape information was limited was consistent between Experiments 1 and 2. In particular, Experiment 2 had lower scores than Experiment 1, so the ceiling effect was unlikely. 
Experiment 3
In Experiment 3, we investigated the contribution of P-S statistics in a limited processing time. Here, we used a masking paradigm to determine how the statistical information accumulated over tens of milliseconds. We assumed that the interval between a target and mask onset corresponds to the information processing time, which is validated by vast amounts of studies in visual masking (for a review, see Breitmeyer & Ögmen, 2006), including psychophysiological studies in natural scenes (Bacon-Macé et al., 2005; Rieger, Braun, Bülthoff, & Gegenfurtner, 2005). Without a mask, the information of a visual image persists after it disappears, and its availability lengthens beyond the physical presentation duration (Coltheart, 1980; Sperling, 1960). Our results so far obtained leave the possibility that P-S statistics become available only after a short processing time within which the ultrarapid categorization is achieved. They might not be extractable if the information accumulation is interrupted by a mask. In addition, we set different durations to observe how differently the shape and statistical information accumulated over time. We used noise masks with amplitude spectra having a reciprocal relation to frequency but with randomized phase. Such masks are effective at disrupting the visual processing of scenes (Loschky et al., 2007). 
Methods
Participants
Twenty-four students from Kyoto University participated in the study (14 female, 10 male). The experiment lasted about 2 hr. All received Tosho cards (prepaid cards for purchasing books in Japan) equivalent to 2,000 yen for their participation. They gave informed consent prior to the start of the experiment and had normal or corrected-to-normal vision. 
Stimuli
An additional 240 original images (120 Animal, 120 Non-Animal) were added to the image set used in Experiment 1. The Fourier amplitude of images remained intact. All were grayscale images and had the same mean luminance and RMS contrast of luminance. Next, the same number of textures were synthesized. Each texture had the same P-S statistics as its original image. Moreover, 500 phase-randomized masks were generated. We decomposed white noise into amplitude and phase spectra and substituted the amplitude with an average spectrum of all the original images. Finally, we output the masks by an inverse transformation. The masks had the same mean luminance and RMS contrast of luminance as the original and synthesized images. For each trial, one of 500 masks was chosen. The images subtended 5.44 × 5.44 degrees of visual angle. 
Experimental design
In this experiment, participants experienced all possible combinations of two eccentricities, three cue types, and five stimulus durations. The cue type Statistics-only condition was paired with the Statistics+shape condition (testing the presence or absence of shape) and None condition (testing the statistical difference). The experiment included two independent 2 (eccentricity) × 2 (cue type) × 5 (durations) within-participant factorial designs. 
The factor settings of eccentricity and cue type were the same as Experiments 1 and 2. Participants experienced three sessions across which different types of cues were presented. The duration was set to 13, 40, 67, 93, or 120 ms. The sensitivity measure A′ and mean RT for hit and correct rejection responses were estimated as behavioral measures. 
Each target image (Animal) appeared once in each session. The relationship between images and the set of conditions (eccentricity and duration) was different across sessions. For distracter images (all but Animal), each was presented once throughout the experiment. To avoid any decision bias induced by the selection of images, the order of sessions and the relationship between the images and set of conditions were counterbalanced. In addition, half of the images were presented at the center of the screen and the rest were presented on the right side. Trials for the two eccentricity conditions were intermixed in a session with the same probability. 
Procedure
The difference between Experiment 3 and Experiments 1 and 2 was that there were five stimulus durations, after which the masking stimuli were displayed for 93 ms (Figure 2B). 
Results
The sensitivities were compared across conditions (Figure 6A). In contrast to results in Experiments 1 and 2, the contribution of shape information was observed in peripheral vision rather than central vision. We first analyzed how the shape affected the performance using a three-way, within-model ANOVA. This tests our first prediction that shape information makes a greater contribution to central than peripheral vision. The cue type None condition was not included. It revealed the main effects of all factors: eccentricity, F(1, 23) = 194.799, MSe = 0.013, p < 0.0001, ηG = 0.326; cue type, F(1, 23) = 8.382, MSe = 0.016, p = 0.008, ηG = 0.024; duration, F(2.65, 60.88) = 163.259, MSe = 0.012, p < 0.0001, ηG = 0.490; the interaction effect of eccentricity × cue type, F(1, 23) = 4.364, MSe = 0.007, p = 0.048, ηG = 0.006, and eccentricity × duration, F(2.15, 49.36) = 8.945, MSe = 0.019, p = 0.0004, ηG = 0.063. The cue type × duration interaction and second-order interaction were not significant, ps > 0.065. Since our concern was how the relationship between a target and distracter affected the discriminability, we reanalyzed the pooled data across duration conditions. We found the detrimental effect of shape information loss in the periphery, F(1, 23) = 9.193, MSe = 0.016, p = 0.006, ηG = 0.038, rather than the fovea, F(1, 23) = 2.497, MSe = 0.007, p = 0.128, ηG = 0.010. Likewise, we tested our second prediction that statistical information makes a smaller contribution to central than peripheral vision. The cue type Statistics+shape condition was not included in this analysis. We examined the statistical information and found its contribution irrespective of eccentricities and stimulus durations. The main effect of eccentricity, F(1, 23) = 221.119, MSe = 0.016, p < 0.0001, ηG = 0.359; cue type, F(1, 23) = 10.890, MSe = 0.020, p = 0.003, ηG = 0.034; and duration, F(2.67, 61.46) = 87.204, MSe = 0.018, p < 0.0001, ηG = 0.400, were significant. However, all interactions were not significant, ps > 0.472, with the exception of eccentricity × duration, F(2.72, 62.51) = 8.8107, MSe = 0.014, p = 0.0001, ηG = 0.052, and unrelated to the difference of cue type. 
Figure 6
 
Experiment 3 data separated into two groups based on eccentricity conditions. Left graphs indicate results when a stimulus was presented at 0° and right graphs are at 14°. The error bars indicate the standard errors of the mean. (A) Sensitivities. (B) Mean RTs calculated from hit responses. Although some participants showed no hit response at 13 and 40 ms, mean values and standard errors calculated from remaining participants are shown for reference. (C) Mean RTs from correct rejection responses.
Figure 6
 
Experiment 3 data separated into two groups based on eccentricity conditions. Left graphs indicate results when a stimulus was presented at 0° and right graphs are at 14°. The error bars indicate the standard errors of the mean. (A) Sensitivities. (B) Mean RTs calculated from hit responses. Although some participants showed no hit response at 13 and 40 ms, mean values and standard errors calculated from remaining participants are shown for reference. (C) Mean RTs from correct rejection responses.
We examined mean RTs of hit and correct rejection responses. The results did not indicate a tradeoff. Owing to the missing response by some participants, hit responses at 13 and 40 ms were not statistically analyzed. The definition of outlier RTs was the same as Experiment 1 and excluded 5.5% (hit) and 5.2% (correct rejection) of responses on average. Figure 6B and C shows mean RTs for hit responses and correct rejections, respectively. For hit responses, all main and interaction effects associated with cue type were not significant, irrespective of the pairs of cue type to be compared, ps > 0.109. We also analyzed correct rejection responses. When we tested the effects regarding cue type, only the interaction of cue type × duration in Statistic+shape versus Statistics-only comparison was observed, p = 0.017. When we split the data into five groups by duration, however, the analysis revealed that Non-Animal distracters in the Statistics+shape condition were rejected faster than Texture(Non-Animal) distracters in the Statistics-only condition at 120-ms duration, p = 0.015. 
Discussion
In addition to Experiments 1 and 2, the current results also disproved the prediction that shape information is more crucial for central vision than peripheral vision. Rather, the results indicate the opposite pattern. 
For statistical information, the results suggest that humans use it in both central and peripheral vision and its effects are comparable, even when a target was briefly masked within tens of milliseconds. Previous studies using a masking paradigm have shown that periods of undistorted processing of about 40–60 ms (Bacon-Macé et al., 2005) or 90 ms (Rieger et al., 2005) are sufficient for asymptotic performance, suggesting that P-S statistics are extracted and used in the early stage of visual processing across a wide range of the visual field. There is room to discuss comparable effects of the performance impairment. Similar to Experiment 2, the results in the current experiment had lower scores than Experiment 1, which might have caused the floor effect in the peripheral condition. 
Experiment 4
In Experiment 4, we investigated whether humans are sensitive to the difference in the statistics attributable to the presence or absence of animals without global shape clues. In the previous experiments, Animal images were applied to targets to establish the “animal detection” task, in which participants expected the clear presence of animal objects in pictures. Unlike synthesized textures, however, Animal images have global shape information as a clue for animal detection in addition to P-S statistics. Hence, whether the statistics by themselves contribute to the task remains unanswered. In the extreme case, the statistics might need to coexist with global shape clues to be used. If humans can judge animal presence/absence by statistical information alone, the task performance when discriminating Texture(Animal) from Texture(Non-Animal) images should outperform chance level. 
Methods
Participants
Thirteen students from Kyoto University participated in the study (2 female, 11 male). The experiments lasted about 1 hr and all participants received 1,000 yen/hr for their participation. They gave informed consent prior to the start of the experiment and had normal or corrected-to-normal vision. 
Stimuli
All the images of Experiment 1 were used. Each group had 120 images, and 480 images were used overall. 
Experimental design
Two types of independent variables were set in this experiment. One was the eccentricity of stimuli presentation and the other was an image property used as targets and distracters. Participants experienced all possible combinations of variables. The eccentricities were set to 0° and 14° right, as in Experiment 1. The image property tested was whether the stimuli to be discriminated were original pictures or synthesized textures. Although setting the comparison between Texture(Animal) and Texture(Non-Animal) would be enough to achieve our goal, it might be hard for participants to retain motivation, given the task difficulty. We also included trials discriminating nontextured groups as filler trials. Target groups were Animal and Texture(Animal), and the distracter groups were Non-Animal and Texture(Non-Animal). A dependent variable was A′. Each image was presented once in each eccentricity. The experiment had two sessions and all four groups of images appeared in each session. In a trial, a target or distracter image was presented with a probability of 50%. An image was equally presented at each eccentricity. 
Procedure
Two sessions, each of which was composed of 10 blocks of 48 trials, were performed (960 trials in total). We began the task with a practice of 96 trials before the first session. Trial sequences were the same as Experiment 1. The only difference was the method of response. In this experiment, original Animal and synthesized Texture(Animal) images were applied to targets. Participants were told in advance that textures were synthesized from natural images using a specific algorithm. Moreover, they had an opportunity to see 24 original and 24 texture samples to see how the algorithm operated. 
Results
Figure 7 shows sensitivities. Responses associated with originals and textures were separated to estimate A′ values. Scores of Animal versus Non-Animal are also given, although our primary interest is the performance when discriminating textures. If the score is greater than .5, this means that participants achieved the task at above chance level. One-sample t tests revealed participants' above-chance performance at both eccentricities: 0°, t(12) = 11.596, p < 0.0001, d = 3.347; 14°, t(12) = 5.620, p < 0.001, d = 1.622. 
Figure 7
 
Sensitivities with standard errors of the mean in Experiment 4. Two types of discrimination, Animal versus Non-Animal and Texture(Animal) versus Texture(Non-Animal) were analyzed separately.
Figure 7
 
Sensitivities with standard errors of the mean in Experiment 4. Two types of discrimination, Animal versus Non-Animal and Texture(Animal) versus Texture(Non-Animal) were analyzed separately.
Discussion
Participants could discriminate textures based on the statistics alternating with the presence or absence of animals in a picture, suggesting that the P-S statistics themselves are useful information across a wide range of visual fields, even when the images did not contain a global configuration of local features. Moreover, a mild decrease of performance when comparing textures suggests the robust availability of the statistical information regardless of the eccentricity condition. 
We withhold the interpretation of performance difference between Animal versus Non-Animal and Texture(Animal) versus Texture(Non-Animal). As mentioned at the beginning of the Experiment 1 section, we should be cautious about comparing these conditions. There remains a possibility that the cognitive processing when participants perform texture discrimination includes some kind of makeshift (and worthless) solution. We think that we need to discuss shape contribution based primarily on the results of Experiments 1, 2, and 3
Experiment 5
Experiment 5 was conducted to investigate if the P-S statistics of each image predicted human performance. Different images are included in this study so that some images are easily discriminated and others are not. If the higher-order statistics actually reflect the information humans use, it will not be surprising if a picture rich in these statistics will be easy for a human to discriminate. To test this, we isolated pictures into easy to discriminate and difficult to discriminate using a supervised machine-learning technique, and observed how easily human participants recognized them. 
Discrimination by machine learning
Images
The image database provided by Serre et al. (2007) contains 600 images defined as Animal and 600 as Non-Animal. We excluded misleading images: one Animal image that contains no animal, two Non-Animal images that contain an animal, and 59 Non-Animal images that include humans. There were 599 Animal and 539 Non-Animal images remaining. 
Evaluation by machine learning
We generated a machine-learned model based on the higher-order statistics. The P-S algorithm outputs the measurement of many parameters from a wavelet-based decomposition. We obtained a set of feature vectors of the statistics (1,229 coefficients) from a set of images and adopted the support vector machine library software, LIBSVM (Chang & Lin, 2001), for constructing the learned model. LIBSVM is a well-known library for support vector machines (SVM) that learns a set of vectors that contains two different groups and generates a model to dissociate them; the model can then classify an untrained data set. The procedure is as follows. First, the 1,138 images from Animal and Non-Animal groups were separated into training and test sets. The training set included an equal number of Animal and Non-Animal images (269 images each). Second, the LIBSVM learned a model and estimated how accurately the test images were discriminated. We repeated the procedure 10,000 times and obtained the proportion of correct classifications for each image. Different training and test sets were randomly selected each time. LIBSVM requires some parameter configuration. We used a C-support vector classification (SVC) SVM, a radial basis function for the kernel, and set the cost parameter of error term C to one, corresponding to default settings of the library. Although searching for optimal parameters might improve the discrimination performance, our goal was to observe the relative difficulty of discriminating different images. Consequently, the average number of image estimations was 5,272 (SD = 255). 
Image selection
A high proportion of correct classifications for a picture is a sign that its feature vector contains information representative of the groups to which it belongs. We sorted images according to the proportion correctly classified and divided into subsets that included 120 Easy and 120 Difficult images to discriminate. Figure 8A and B shows some examples. In Figure 8C, the averages of the proportion correct are given. The Easy group contained images with perfect accuracy and the Difficult group consisted of images with nearly zero accuracy. 
Figure 8
 
(A–B) Examples of images classified as Easy (A) and Difficult (B) based on the support vector machine library, LIBSVM (Chang & Lin, 2001). Each group contained 120 images. Animal images are arranged in the top row and Non-Animal in the bottom row. (C) Proportion of correct responses by LIBSVM. Values indicated by bars were obtained for each image by averaging the proportion correctly classified over a large number of classification tests. In this figure, error bars indicate standard deviation.
Figure 8
 
(A–B) Examples of images classified as Easy (A) and Difficult (B) based on the support vector machine library, LIBSVM (Chang & Lin, 2001). Each group contained 120 images. Animal images are arranged in the top row and Non-Animal in the bottom row. (C) Proportion of correct responses by LIBSVM. Values indicated by bars were obtained for each image by averaging the proportion correctly classified over a large number of classification tests. In this figure, error bars indicate standard deviation.
Human discrimination
Participants
Fifteen students from Kyoto University participated in the study (six female, nine male). All participated in the experiment, which lasted about 1 hr. Thirteen received Tosho cards equivalent to 1,000 yen for their participation, and two participated for course credit. All gave informed consent prior to the start of the experiment and had normal or corrected-to-normal vision. 
Experimental design
Two independent variables were manipulated: eccentricity (0° vs. 14° right) and image difficulty (Easy vs. Difficult). We note again that the labels Easy and Difficult are based on the proportion correctly classified by a machine learning method. Participants were kept uninformed about their grouping during the experiment. Behavioral measures were A′ and the mean RT for hit and correct rejection responses. 
Each image appeared once at the center of the screen (0°) and once 14° to the right. In human discrimination task, participants saw original images. Synthesized textures were not used. The experiment consisted of two sessions, both of which included all possible combinations of variables. The number of trials in which target and distracter images were presented was halved. Moreover, all conditions appeared with equal probability. 
Procedure
A session was composed of 10 blocks of 48 trials and all participants completed two sessions. Thus, they experienced 960 judgments. A practice session (84 trials) preceded the experimental blocks. The sequence of this experiment was almost the same as in Experiment 1 with the exception that the exposure duration was shortened (27 ms). Participants were asked to rest at least 3 min between sessions. 
Results
Figure 9A represents sensitivities. Responses associated with Easy and Difficult images were separated. Results indicated that performances in Easy groups are higher than in Difficult groups in both eccentricity conditions. A 2 × 2 ANOVA revealed the main effects of all factors, ps < 0.0001, and the interaction effect, F(1, 14) = 98.261, MSe = 0.001, p < 0.0001, ηG = 0.324. We further compared Easy versus Difficult groups at each eccentricity condition and found that the scores for Easy groups were significantly higher than those for Difficult groups, paired t test, ps < 0.0001. 
Figure 9
 
Experiment 5 data by human participants. The average performances are shown with standard errors of the mean. Easy and Difficult groups were analyzed separately. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 9
 
Experiment 5 data by human participants. The average performances are shown with standard errors of the mean. Easy and Difficult groups were analyzed separately. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Outlier RTs were excluded in advance, and amounted to 5.1% (hits) and 4.6% (correct rejections) on average. Figure 9B and C shows the mean RT for hit and correct rejection, respectively. Results did not show a tradeoff. For hits, the main effects were significant for both factors, ps < 0.0001, while the interaction was not, p = 0.059, indicating that mean RTs for Easy images are shorter than for Difficult images at both eccentricities. In the case of correct rejections, the main effects were significant at both factors, ps < 0.001, while the interaction was not, p = 0.504. 
Discussion
Easy (or Difficult) images to discriminate using machine learning were easy (or difficult) to discriminate for human participants at both eccentricities. This suggests that higher-order statistics predict human performance, making our claim more convincing from another aspect, that the higher-order statistics serve as the critical information for animal detection by humans in central and peripheral vision (Rosenholtz, Huang, Raj et al., 2012b). 
General discussion
The present study addressed the issue of how eccentricity affects the contribution of shape and higher-order statistical information for rapid object categorization. We manipulated the availability of this information using textures synthesized to adjust the statistics of the original scenes. We tested if shape information makes a greater contribution to central vision than to peripheral vision, and if statistical information makes a smaller contribution to central vision than to peripheral vision. Results did not indicate a clearly biased contribution. For shape, our data did not support its larger contribution in central vision. The contribution in central vision was as good (Experiments 1 and 2) or smaller than in peripheral vision (Experiment 3). For statistical information, current results indicated the robust contribution at central and peripheral vision, but were inconclusive about whether the contribution increases with eccentricity. We further confirmed in Experiment 3 that statistical information is available in a limited processing time. In Experiment 4, we showed that the statistics themselves contributed to the discrimination of textures synthesized from Animal and Non-Animal original images. Finally, in Experiment 5, we compared the performance of humans with machine learning and found a relationship between them. 
The contribution of statistical information disproves the possibility that the statistics are not used in central vision owing to its fine-scaled receptive fields. The central vision consists of many extremely small receptive fields, which avoids the spatial alignment of visual elements being disrupted within these fields (Balas et al., 2009; Freeman & Simoncelli, 2011; Rosenholtz, Huang, & Ehinger, 2012a; Rosenholtz, Huang, Raj et al., 2012b). Instead, we argue that the statistical computation is applied to overcome the time constraint for rapid categorization. Animals were detected with uninterrupted processing within several tens of milliseconds (Bacon-Macé et al., 2005; Rieger et al., 2005). Moreover, such categorization could be followed by a perceptual grouping phase (Korjoukov et al., 2012). The hierarchical approach for object recognition, which supposes grouping and contour integration, is vulnerable with respect to the analysis of complex scenes. 
Our results extend the cognitive role of higher-order statistics in visual processing. Previous studies that have used them showed their interest in how statistical representation of visual input is related to our poor perception in peripheral vision (Balas et al., 2009; Freeman & Simoncelli, 2011; Rosenholtz, Huanj, Raj et al., 2012b). In the literature, texture synthesis techniques were used to visualize the degraded representation in the periphery. They have treated higher-order statistics as a candidate for that representation. Our findings refocus on the potential of these statistics in higher-level object understanding. It is also consistent with the finding by Gaspar and Rousselet (2009), who discussed the importance of the interaction between phase and amplitude, rather than amplitude on its own in an animal detection task. They pointed out that both codetermine the appearance of image structure such as edges and corners. We note that phase and amplitude are both partially reflected in P-S statistics. 
Central vision did not place as much weight on shape information as predicted. Its contribution in the fovea was comparable to (Experiments 1 and 2) or smaller than (Experiment 3) the periphery. It seems inconsistent with the argument that elements are spatially pooled within large receptive fields (Balas et al., 2009; Freeman & Simoncelli, 2011). One interpretation is that the shape information is extracted slowly and its contribution is globally small, because of the time needed for processing in our experimental setting. Another interpretation is that performance impairments by disrupting “shape” (Statistics-only vs. Statistics+shape) might also reflect the unexpected disruption of a certain type of statistical information that cannot be encoded in the textures of our experiments. Unlike studies using similar textures (Freeman & Simoncelli, 2011; Rosenholtz, Huang, Raj et al., 2012b), the synthesis algorithm in this study did not apply multiple, overlapping pooling regions scaled with receptive field size. This could have mixed visual elements beyond regions more than necessary and disrupted the statistical information that was preserved within each region. This might cover the eccentricity-dependent effect of shape information if such statistics favor the peripheral vision. The interpretation could be tested by using textures in which pooling regions are considered, though synthesizing such textures is technically difficult. Nevertheless, neither interpretation disclaims the view that shape information has a minor role in central vision under rapid processing. 
We note that the current algorithm suffices for investigating the eccentricity-dependent use of shape and statistical information. In fact, results in Experiment 4 indicate that participants can discriminate textures from Animal and Non-Animal images depending on the P-S statistics, suggesting that the P-S algorithm succeeds in encoding some of the same sort of statistics humans extract. 
An additional important aspect of the present study is that the eccentricity-dependent role of shape and statistics was tested in the same experimental setting. Rosenholtz, Huang, Raj et al. (2012b) demonstrated that the performance when images were naturally viewed correlated with the performance when textural images mimicked degraded perception, owing to the information pooling within receptive fields. They compared behavioral responses obtained from two different tasks, a go/no-go animal detection task and a texture classification task, and found a linear relationship between responses. For example, when a texture was frequently classified as “animal,” a natural image, from which the texture is synthesized, received frequent “animal” responses. However, what they uncovered was that the information represented by a certain type of statistics (in overlapping pooling regions) is predictive of task performance, consistent with the information available to the visual system. The contribution of shape was not studied. Correlating the performance when viewing natural images and synthesized textures cannot evaluate the contribution of other information, such as global shape. Our study experimentally separated the effect of shape from statistics. Moreover, their study was not appropriate for observing how the contribution of the information varies with eccentricity. Because they took receptive field size into consideration when making textures, different textures were synthesized to imitate the perception with different eccentricities. This could complicate the comparison of results obtained from central and peripheral vision, in contrast to our experiments. 
Our experiments were analogous to those employed in the study by Elder and Velisavljević (2009). Their study on animal detection explored the role of potential cues such as boundary shape, texture, luminance, and color in central vision. They used a masking paradigm and suggested the significant contribution of shape and texture cues in a limited processing time. Their study differed from our experiments in how the information availability was manipulated. They limited the amount of texture information by uniformly painting the inner region of a presegmented object with the average color of the corresponding segment. This demonstrated the contribution of the texture pattern, encoded in P-S statistics, with its location constrained. Conversely, our method disrupted shape information. Our findings extend those of Elder and Velisavljević (2009): local structures still contribute to the task even when they are not gathered within a closed region. Additionally, our method highlights the direct use of statistical (texture) information. The presence of inhomogeneous visual pattern within the boundary could be beneficial to the figure–ground segmentation process. Using texture synthesis, we could exclusively observe the effect of the pattern as a direct clue. 
Our study reinforces the role of local structures in human categorical processing. In the domain of computer vision, different models have proposed different types of statistics to use for natural object and scene categorization. Along with second-order statistics (Torralba & Oliva, 2003), models also exist that propose information similar to P-S statistics. Some scene classification models consider how frequently local features such as straight lines and curves with contrast polarity occur (Fei-Fei & Perona, 2005; Renninger & Malik, 2004). For an object, visual features of intermediate complexity were proposed as optimal components for classification (Serre et al., 2007; Shotton et al., 2008; Ullman, Vidal-Naquet, & Sali, 2002). However, what they showed relied on a comparison of the proposed model and human performance (Renninger & Malik, 2004; Serre et al., 2007). A mere comparison is insufficient because it might measure spurious correlation, such as a study of the power spectrum by Wichmann et al. (2010), in which the authors pointed out a problem in the image database researchers had been using. Images in the database containing animals at the center region have narrow depth-of-field and lead to a particular pattern of power spectra. They found that such a pattern is a potentially strong, but largely irrelevant, cue for human vision. They conjectured that the performance of a computational model (Torralba & Oliva, 2003) only incidentally correlated with the human performance. Meanwhile, our results indicate that textures with the same statistics as images containing animals were actually confused with the original images containing animals. 
The control of fixation is an important issue in the investigation of the eccentricity-dependent role of shape and statistical information. There is no direct evidence to support whether the participants actually gazed at the fixation point because we did not record their eye movements during the task. However, a post hoc analysis suggests that the effect of neglecting experimental instructions was negligible. We estimate how many participants might ignore the instruction and reanalyze the data with their responses excluded. If their fixations were attracted to a peripheral location, this should reduce the performance difference between the central and peripheral presentation. Moreover, fixating at a more peripheral location than 7° to the right side of the screen center (i.e., the halfway point between the central and peripheral image location) could reverse the magnitude of the performance relationship between eccentricity conditions. We picked out pairs of performances that recorded higher A′ scores at 14° (compared with 0° eccentricity) and excluded any person who had such pairs from the reanalysis. The results are summarized as follows: in Experiments 1 and 5, there was no participant who met this criterion. In Experiments 2, 3, and 4, a small subset of the comparison pairs violated this criterion. However, excluding participants who had such pairs did not affect our conclusions in the three experiments. 
One might question why the results from Experiments 1 and 2 had different levels of performance. Experiment 2 had lower scores than Experiment 1. The experiments applied the same experimental paradigm with the exception that the power spectra (i.e., second-order statistics) were equalized between the images in Experiment 2 but not in Experiment 1. This seems inconsistent with the report that the power spectrum is not a diagnostic cue used by humans (Gaspar & Rousselet, 2009; Wichmann et al., 2010). One possible explanation is that higher-order statistics are not generally independent of second-order statistics. P-S statistics capture the correlations of responses across different orientations, frequencies, and positions using a set of wavelet filters. Considering the Fourier transformation, manipulating power spectra is the same as increasing the amplitude of one sinusoidal waveform and decreasing the amplitude of another. Therefore, amplitude equalization inevitably affects the output of wavelet filters and the correlation between them. The study by Gaspar and Rousselet (2009) supports this idea: they reported that amplitude equalization between natural images influenced the appearance of local structures as measured by local phase congruency (Kovesi, 2000, 2003). This congruency determines the significance of edges or corners encoded by P-S statistics. This manipulation might have made it difficult to match the statistical representation of visual input to internal knowledge regarding the statistics of scenes containing animals. 
The definition of shape needs to be carefully considered. We first operationally defined the term “shape” as closed contours that are possessed by meaningful and familiar objects in the real world. By following this definition, we can discuss the contribution of shape and statistical information using the texture synthesis technique. However, we should be cautious with the validity of this operational definition. Our definition, in principle, does not consider the visual patterns of local features that constitute fragment-shape. In Experiments 1, 2, and 3, there remains a possibility that such fragment-shape confounded the comparison between Statistics+Shape and Statistics-only conditions. Nevertheless, our manipulation of eliminating shape information should keep its validity. If we do not rely on global shape cue in the task, the performance in the Statistics-only condition outperforms it in the Statistics+Shape condition due to the clearer difference of image coherence between Animal and Texture(Non-Animal) images. Our attempt to dissociate shape and statistics would give a contribution to a future study; however its dissociation might be problematic. 
Future work is needed to overcome the limitations of our study. We do not know how our data generalizes beyond animal categories. A variety of categories such as vehicles (VanRullen & Thorpe, 2001a, 2001b), foods (Delorme, Richard, & Fabre-Thorpe, 2000), and human faces (Rousselet, Macé, & Fabre-Thorpe, 2003) have been discussed in the context of rapid categorization. In addition, it is unclear to what level categorization by statistics could be achieved. An object could be understood at multiple levels of abstraction (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976) with finer categories requiring more time to process (Macé, Joubert, Nespoulous, & Fabre-Thorpe, 2009). Finally, our data cannot reveal how the statistical information is utilized. Humans might activate the intrinsic shape of an object from statistical properties. Alternatively, they might enforce direct comparison of the properties with their statistical knowledge about the real world stored in long-term memory. 
In conclusion, our data suggest that the prediction that central vision relies on shape information and peripheral vision relies on statistics is inaccurate. When facing a time constraint such as rapid categorical processing, our central vision might use higher-order statistical information in a more proactive manner than expected. 
Acknowledgments
This work was supported by a Grant-in-Aid for Scientific Research (21300103, 23135515, 24240041, 25135719, and 25880023), and for JSPS Fellows (22-2552) from the Japan Society for the Promotion of Science. 
Commercial relationships: none. 
Corresponding author: Hayaki Banno. 
Address: Kyoto University, Kyoto, Japan. 
References
Bacon-Macé N. Macé M. J.-M. Fabre-Thorpe M. Thorpe S. J. (2005). The time course of visual processing: Backward masking and natural scene categorisation. Vision Research, 45, 1459–1469. doi:10.1016/j.visres.2005.01.004. [CrossRef] [PubMed]
Balas B. (2006). Texture synthesis and perception: Using computational models to study texture representations in the human visual system. Vision Research, 46, 299–309. doi:10.1016/j.visres.2005.04.013. [CrossRef] [PubMed]
Balas B. Nakano L. Rosenholtz R. (2009). A summary-statistic representation in peripheral vision explains visual crowding. Journal of Vision, 9 (12): 13, 1–18, http://www.journalofvision.org/content/9/12/13, doi:10.1167/9.12.13. [PubMed] [Article]
Bergen J. R. Julesz B. (1983). Parallel versus serial processing in rapid pattern discrimination. Nature, 303, 696–698. doi:10.1038/303696a0. [CrossRef] [PubMed]
Biederman I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94, 115–117. doi:10.1037/0033-295X.94.2.115. [CrossRef] [PubMed]
Brainard D. H. (1997). The psychophysics toolbox. Spatial Vision, 10, 433–436. doi:10.1163/156856897X00357. [CrossRef] [PubMed]
Breitmeyer B. Ögmen H. (2006). Visual masking: Time slices through conscious and unconscious vision. New York: Oxford University Press.
Bullier J. Hupe J. M. James A. C. Girard P. (2001). The role of feedback connections in shaping the responses of visual cortical neurons. Progress in Brain Research, 134, 193–204. doi:10.1016/S0079-6123(01)34014-1. [PubMed]
Chang C. C. Lin C. J. (2001). LIBSVM: A library for support vector machines [Computer software]. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm
Coltheart M. (1980). Iconic memory and visible persistence. Perception & Psychophysics, 27, 183–228. doi:10.3758/BF03204258. [CrossRef] [PubMed]
Delorme A. Richard G. Fabre-Thorpe M. (2000). Ultra-rapid categorisation of natural scenes does not rely on colour cues: A study in monkeys and humans. Vision Research, 40 (16), 2187–2200. [CrossRef] [PubMed]
Elder J. H. Velisavljević L. (2009). Cue dynamics underlying rapid detection of animals in natural scenes. Journal of Vision, 9 (7): 7, 1–20, http://www.journalofvision.org/content/9/7/7, doi:10.1167/9.7.7. [PubMed] [Article]
Fei-Fei L. Perona P. (2005). A Bayesian hierarchy model for learning natural scene categories. In Schmid C. Soalto S. Tomasi C. (Eds.), Computer vision and pattern recognition ( Vol. 2. pp. 524–531). Los Alamitos, CA: IEEE Computer Society Press, doi:10.1109/CVPR.2005.16.
Field D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A, 4 (12), 2379–2394, doi:10.1364/JOSAA.4.002379. [CrossRef]
Field D. J. (1989). What the statistics of natural images tell us about visual coding. In Rogowitz B. E. (Ed.), Proceedings of SPIE, Human Vision, Visual Processing, and Digital Display (Vol. 1077, pp. 269–276). Bellingham, WA, USA. doi:10.1117/12.952724.
Freeman J. Simoncelli E. P. (2011). Metamers of the ventral stream. Nature Neuroscience, 14 (9), 1195–1201, doi:10.1038/nn.2889. [CrossRef] [PubMed]
Freeman J. Ziemba C. M. Heeger D. J. Simoncelli E. P. Movshon J. A. (2013). A functional and perceptual signature of the second visual area in primates. Nature Neuroscience, 16, 974–981, doi: 10.1038/nn.3402. [CrossRef] [PubMed]
Gaspar C. M. Rousselet G. A. (2009). How do amplitude spectra influence rapid animal detection? Vision Research, 49, 3001–3012, doi:10.1016/j.visres.2009.09.021. [CrossRef] [PubMed]
Grossberg S. Mingolla E. (1985). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychological Review, 92, 173–211, doi:10.1037/0033-295X.92.2.173. [CrossRef] [PubMed]
Guyader N. Chauvin A. Peyrin C. Hérault J. Marendaz C. (2004). Image phase or amplitude? Rapid scene categorization is an amplitude-based process. Comptes Rendus Biologies, 327 (4), 313–318, doi:10.1016/j.crvi.2004.02.006. [CrossRef] [PubMed]
Hubel D. Wiesel T. (1959). Receptive fields of single neurones in the cat's striate cortex. The Journal of Physiology (London), 148, 574–591. [CrossRef]
Hubel D. Wiesel T. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology (London), 195, 215–243. [CrossRef]
Hupé J. M. James A. C. Payne B. R. Lomber S. G. Girard P. Bullier J. (1998). Cortical feedback improves discrimination between figure and background by V1, V2, and V3 neurons. Nature, 394 (6695), 784–787, doi:10.1038/29537. [PubMed]
Kaping D. Tzvetanov T. Treue S. (2007). Adaptation to statistical properties of visual scenes biases rapid categorization. Visual Cognition, 15 (1), 12–19, doi:10.1080/13506280600856660. [CrossRef]
Korjoukov I. Jeurissen D. Kloosterman N. A. Verhoeven J. E. Scholte H. S. Roelfsema P. R. (2012). The time course of perceptual grouping in natural scenes. Psychological Science, 23 (12), 1482–1489, doi:10.1177/0956797612443832. [CrossRef] [PubMed]
Kovesi P. (2000). Phase congruency: A low-level image invariant. Psychological Research, 64 (2), 136–148, doi:10.1007/s004260000024. [CrossRef] [PubMed]
Kovesi P. (2003, December). Phase congruency detects corners and edges. In The Australian Pattern Recognition Society Conference: DICTA 2003 ( pp. 309–318). Sydney, Australia.
Lettvin J. Y. (1976). On seeing sidelong. The Sciences, 16 (4), 10–20. [CrossRef]
Loschky L. C. Hansen B. C. Sethi A. Pydimarri T. N. (2010). The role of higher order image statistics in masking scene gist recognition. Attention, Perception, & Psychophysics, 72 (2), 427–444, doi:10.3758/APP.72.2.427. [CrossRef]
Loschky L. C. Sethi A. Simons D. J. Pydimarri T. N. Ochs D. Corbeille J. L. (2007). The importance of information localization in scene gist recognition. Journal of Experimental Psychology. Human Perception and Performance, 33 (6), 1431–1450, doi:10.1037/0096-1523.33.6.1431. [CrossRef] [PubMed]
Macé M. J.-M. Joubert O. R. Nespoulous J. Fabre-Thorpe M. (2009). The time-course of visual categorizations: You spot the animal faster than the bird. PloS ONE, 4 (6), e5927, doi:10.1371/journal.pone.0005927. [CrossRef] [PubMed]
Malik J. Perona P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7 (5), 923–932, doi:10.1364/JOSAA.7.000923. [CrossRef]
Marr D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: W. H. Freeman.
May K. A. Hess R. F. (2008). Effects of element separation and carrier wavelength on detection of snakes and ladders: Implications for models of contour integration. Journal of Vision, 8 (13): 4, 1–23, http://www.journalofvision.org/content/8/13/4, doi:10.1167/8.13.4. [PubMed] [Article]
Motoyoshi I. Nishida S. Sharan L. Adelson E. H. (2007). Image statistics and the perception of surface qualities. Nature, 447, 206–209, doi:10.1038/nature05724. [CrossRef] [PubMed]
Olejnik S. Algina J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8, 434–447, doi:10.1037/1082-989X.8.4.434. [CrossRef] [PubMed]
Oliva A. Torralba A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42, 145–175, doi:10.1023/A:1011139631724. [CrossRef]
Parkes L. Lund J. Angelucci A. Solomon J. Morgan M. (2001). Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4, 739–744, doi:10.1038/89532. [CrossRef] [PubMed]
Pelli D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10, 437–442, doi:10.1163/156856897X00366. [CrossRef] [PubMed]
Pollack I. Norman D. A. (1964). A nonparametric analysis of recognition experiments. Psychonomic Science, 1, 125–126. [CrossRef]
Portilla J. Simoncelli E. P. (2000). A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40, 49–71, doi:10.1023/A:1026553619983. [CrossRef]
Reinhard E. Pouli T. Cunningham D. (2010, July). Image statistics: From data collection to applications in graphics. In ACM SIGGRAPH 2010 Courses (No. 6). New York: Association for Computing Machinery.
Renninger L. W. Malik J. (2004). When is scene identification just texture recognition? Vision Research, 44, 2301–2311, doi:10.1016/j.visres.2004.04.006. [CrossRef] [PubMed]
Rieger J. W. Braun C. Bülthoff H. H. Gegenfurtner K. R. (2005). The dynamics of visual pattern masking in natural scene processing: A magnetoencephalography study. Journal of Vision, 5 (3): 10, 275–286, http://www.journalofvision.org/content/5/3/10, doi:10.1167/5.3.10. [PubMed] [Article] [PubMed]
Roelfsema P. R. Lamme V. A. F. Spekreijse H. Bosch H. (2002). Figure–ground segregation in a recurrent network architecture. Journal of Cognitive Neuroscience, 14, 525–537, doi:10.1162/08989290260045756. [CrossRef] [PubMed]
Rosch E. Mervis C. B. Gray W. D. Johnson D. M. Boyes-Braem P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439, doi:10.1016/0010-0285(76)90013-X. [CrossRef]
Rosenholtz R. Huang J. Ehinger K. A. (2012a). Rethinking the role of top-down attention in vision: Effects attributable to a lossy representation in peripheral vision. Frontiers in Psychology, 3, 13, doi:10.3389/fpsyg.2012.00013. [CrossRef]
Rosenholtz R. Huang J. Raj A. Balas B. J. Ilie L. (2012b). A summary statistic representation in peripheral vision explains visual search. Journal of Vision, 12 (4): 14, 1–17, http://www.journalofvision.org/content/12/4/14, doi:10.1167/12.4.14. [PubMed] [Article]
Rosenholtz R. Twarog N. R. Schinkel-Bielefeld N. Wattenberg M. (2009). An intuitive model of perceptual grouping for HCI design. In Olsen D. R. Jr. R. B. Arthur K. Hinckley M. Ringel Morris Hudson S. E. Greenberg S. (Eds.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems CHI 2009 ( pp. 1331–1340). New York: Association for Computing Machinery. doi:10.1145/1518701.1518903.
Rousselet G. A. Macé M. J.-M. Fabre-Thorpe M. (2003). Is it an animal? Is it a human face? Fast processing in upright and inverted natural scenes. Journal of Vision, 3 (6): 5, 440–455, http://www.journalofvision.org/content/3/6/5, doi:10.1167/3.6.5. [PubMed] [Article] [PubMed]
Scholte H. S. Jolij J. Fahrenfort J. J. Lamme V. A. (2008). Feedforward and recurrent processing in scene segmentation: Electroencephalography and functional magnetic resonance imaging. Journal of Cognitive Neuroscience, 20, 2097–2109, doi:10.1162/jocn.2008.20142. [CrossRef] [PubMed]
Serre T. Oliva A. Poggio T. (2007). A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, USA, 104, 6424–6429, doi:10.1073/pnas.0700622104. [CrossRef]
Shotton J. Blake A. Cipolla R. (2008). Multiscale categorical object recognition using contour fragments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1270–1281, doi:10.1109/TPAMI.2007.70772. [CrossRef] [PubMed]
Sperling G. (1960). The information available in brief visual presentations. Psychological Monographs: General and Applied, 74 (11), 1–29, doi:10.1037/h0093759. [CrossRef]
Stanislaw H. Todorov N. (1999). Calculation of signal detection theory measures. Behavior Research Methods, Instruments, & Computers, 31, 137–149, doi:10.3758/BF03207704. [CrossRef]
Supèr H. Lamme V. A. (2007). Altered figure–ground perception in monkeys with an extra-striate lesion. Neuropsychologia, 45, 3329–3334, doi:10.1016/j.neuropsychologia.2007.07.001. [CrossRef] [PubMed]
Supèr H. Romeo A. Keil M. (2010). Feed-forward segmentation of figure–ground and assignment of border-ownership. PLOS ONE, 5, e10705, doi:10.1371/journal.pone.0010705. [CrossRef] [PubMed]
Thomson M. G. A. (2001). Beats, kurtosis, and visual coding. Network: Computation in Neural Systems, 12 (3), 271–287, doi:10.1088/0954-898X/12/3/303. [CrossRef]
Thorpe S. J. Fize D. Marlot C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522, doi:10.1038/381520a0. [CrossRef] [PubMed]
Thorpe S. J. Gegenfurtner K. R. Fabre-Thorpe M. Bülthoff H. H. (2001). Detection of animals in natural images using far peripheral vision. European Journal of Neuroscience, 14, 869–876, doi:10.1046/j.0953-816x.2001.01717.x. [CrossRef] [PubMed]
Torralba A. Oliva A. (2003). Statistics of natural images categories. Network: Computation in Neural Systems, 14, 391–412. [CrossRef]
Ullman S. Vidal-Naquet M. Sali E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5, 682–687, doi:10.1038/nn870. [PubMed]
van der Schaaf A. (1998). Natural image statistics and visual processing. (Unpublished doctoral thesis). Rijksuniversiteit Groningen, The Netherlands.
van der Schaaf A. van Hateren J. H. (1996). Modelling the power spectra of natural images: Statistics and information. Vision Research, 36 (17), 2759–2770, doi:10.1016/0042-6989(96)00002-8. [CrossRef] [PubMed]
VanRullen R. Koch C. (2003). Visual selective behavior can be triggered by a feed-forward process. Journal of Cognitive Neuroscience, 15 (2), 209–217, doi:10.1162/089892903321208141. [CrossRef] [PubMed]
VanRullen R. Thorpe S. J. (2001a). Is it a bird? Is it a plane? Ultra-rapid visual categorisation of natural and artifactual objects. Perception, 30, 655–668, doi:10.1068/p3029. [CrossRef]
VanRullen R. Thorpe S. J. (2001b). The time course of visual processing: From early perception to decision-making. Journal of Cognitive Neuroscience, 13, 454–461, doi:10.1162/08989290152001880. [CrossRef]
Wichmann F. A. Braun D. I. Gegenfurtner K. R. (2006). Phase noise and the classification of natural images. Vision Research, 46, 1520–1529, doi:10.1016/j.visres.2005.11.008. [CrossRef] [PubMed]
Wichmann F. A. Drewes J. Rosas P. Gegenfurtner K. R. (2010). Animal detection in natural scenes: Critical features revisited. Journal of Vision, 10 (4): 6, 1–27, http://www.journalofvision.org/content/10/4/6, doi:10.1167/10.4.6. [PubMed] [Article] [CrossRef] [PubMed]
Footnotes
1  The statement that the statistics in the synthesized images are the same as in the original images is met if the statistics are measured on a torus where the top and bottom of the images wrap around, as do the right and left sides. This wraparound often leads to vertical or horizontal edges in the synthesized images. These edges arise from a difference of pixel intensities between the sides found in natural images (e.g., the top of the image has lighter pixels than the bottom).
Figure 1
 
Original images and textures synthesized from them using the texture synthesis algorithm (Portilla & Simoncelli, 2000). Textures are aligned in the same order as the corresponding original images. All groups appeared in Experiments 1, 3, 4, and 5. (A) Original images containing animals are referred to as Animal images in this paper, and (B) original images without an animal are referred to as Non-Animal. (C) Textures synthesized from Animal images are referred to as Textures(Animal), and (D) textures from Non-Animal images are referred to as Texture(Non-Animal).
Figure 1
 
Original images and textures synthesized from them using the texture synthesis algorithm (Portilla & Simoncelli, 2000). Textures are aligned in the same order as the corresponding original images. All groups appeared in Experiments 1, 3, 4, and 5. (A) Original images containing animals are referred to as Animal images in this paper, and (B) original images without an animal are referred to as Non-Animal. (C) Textures synthesized from Animal images are referred to as Textures(Animal), and (D) textures from Non-Animal images are referred to as Texture(Non-Animal).
Figure 2
 
Trial schematics for experiments. (A) Schematic illustration for Experiments 1, 2, 4, and 5. (B) Schematic for Experiment 3, in which targets were masked and their exposure durations varied.
Figure 2
 
Trial schematics for experiments. (A) Schematic illustration for Experiments 1, 2, 4, and 5. (B) Schematic for Experiment 3, in which targets were masked and their exposure durations varied.
Figure 3
 
Experiment 1 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 3
 
Experiment 1 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 4
 
Original images and textures used in Experiment 2. Textures are aligned in the same order as the corresponding original images. Animal (A) and Non-Animal (B) images were obtained by equalizing the Fourier amplitude spectra among a set of images used in Experiment 1. Texture(Animal) (C) and Texture(Non-Animal) (D) were derived from amplitude-equalized images that shared approximately the same spectra pattern.
Figure 4
 
Original images and textures used in Experiment 2. Textures are aligned in the same order as the corresponding original images. Animal (A) and Non-Animal (B) images were obtained by equalizing the Fourier amplitude spectra among a set of images used in Experiment 1. Texture(Animal) (C) and Texture(Non-Animal) (D) were derived from amplitude-equalized images that shared approximately the same spectra pattern.
Figure 5
 
Experiment 2 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 5
 
Experiment 2 data. The average performances are shown with standard errors of the mean. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 6
 
Experiment 3 data separated into two groups based on eccentricity conditions. Left graphs indicate results when a stimulus was presented at 0° and right graphs are at 14°. The error bars indicate the standard errors of the mean. (A) Sensitivities. (B) Mean RTs calculated from hit responses. Although some participants showed no hit response at 13 and 40 ms, mean values and standard errors calculated from remaining participants are shown for reference. (C) Mean RTs from correct rejection responses.
Figure 6
 
Experiment 3 data separated into two groups based on eccentricity conditions. Left graphs indicate results when a stimulus was presented at 0° and right graphs are at 14°. The error bars indicate the standard errors of the mean. (A) Sensitivities. (B) Mean RTs calculated from hit responses. Although some participants showed no hit response at 13 and 40 ms, mean values and standard errors calculated from remaining participants are shown for reference. (C) Mean RTs from correct rejection responses.
Figure 7
 
Sensitivities with standard errors of the mean in Experiment 4. Two types of discrimination, Animal versus Non-Animal and Texture(Animal) versus Texture(Non-Animal) were analyzed separately.
Figure 7
 
Sensitivities with standard errors of the mean in Experiment 4. Two types of discrimination, Animal versus Non-Animal and Texture(Animal) versus Texture(Non-Animal) were analyzed separately.
Figure 8
 
(A–B) Examples of images classified as Easy (A) and Difficult (B) based on the support vector machine library, LIBSVM (Chang & Lin, 2001). Each group contained 120 images. Animal images are arranged in the top row and Non-Animal in the bottom row. (C) Proportion of correct responses by LIBSVM. Values indicated by bars were obtained for each image by averaging the proportion correctly classified over a large number of classification tests. In this figure, error bars indicate standard deviation.
Figure 8
 
(A–B) Examples of images classified as Easy (A) and Difficult (B) based on the support vector machine library, LIBSVM (Chang & Lin, 2001). Each group contained 120 images. Animal images are arranged in the top row and Non-Animal in the bottom row. (C) Proportion of correct responses by LIBSVM. Values indicated by bars were obtained for each image by averaging the proportion correctly classified over a large number of classification tests. In this figure, error bars indicate standard deviation.
Figure 9
 
Experiment 5 data by human participants. The average performances are shown with standard errors of the mean. Easy and Difficult groups were analyzed separately. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
Figure 9
 
Experiment 5 data by human participants. The average performances are shown with standard errors of the mean. Easy and Difficult groups were analyzed separately. (A) Sensitivities measured by A′. (B) Mean RTs calculated from hit responses. (C) Mean RTs from correct rejection responses.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×