Using the HMAX model (Riesenhuber & Poggio,
2002), we showed that the earliest stages of the visual system do not respond to control stimuli generated by phase, box, or texture scrambling in the same way as they do to intact images. We found differences for most measures within each layer of the model (S1, C1, C2); the magnitude of the differences was substantial with average neural activity around two to seven times higher for conventionally scrambled stimuli relative to intact stimuli. This indicates that these image sets differ in their basic visual properties and will obstruct our ability to isolate object recognition processes. We conclude conventional scrambling methods make poor controls in experiments that intend to manipulate semantic content.
In contrast, at each layer of the HMAX model, we found that the mean neural signal was the same for the intact and diffeomorphed image sets. Furthermore, a linear discriminant classifier was generally unable to differentiate between intact and diffeomorphed images based on either their pattern or distribution of neural activity in each layer. Moreover, our results are not restricted to the specific levels of scrambling assigned to each scrambling method. Across 20 equally spaced levels of distortion, we found the mean percentage deviation in neural activity relative to intact images in layer C2 remained relatively constant for diffeomorphed images even when we increased the amount of scrambling by 400%. In contrast, the percentage deviation rose sharply for clearly recognizable phase- and box-scrambled images with very little distortion. These results indicate two things: (a) There was little in the visual content of diffeomorphed images that differentiates them from intact images and (b) the basic visual properties are better preserved in diffeomorphed images than even slight amounts of phase and box scrambling in which the content of the image is easily identifiable.
A number of factors contribute to the differences in neural processing for stimuli generated using the three conventional scrambling methods. During box scrambling, images are divided into an equal number of segments (sometimes as small as individual pixels) with each box randomly shuffled to a new location in the image. It was assumed that because they were created with unaltered segments of the original image, they would be visually matched. Unfortunately, repositioning pixels to different locations in the image produces artificial edges at the borders between discontinuous segments. The result is images that no longer retain their original Gestalt and, even more concerning, contain spatial frequency artifacts contingent on the scrambling resolution. These changes produce different patterns of neural activity (Vogels,
1999) relative to the intact version of the image. Even with attempts to ameliorate the effects of edges by using spatial vignetting (convolving the edges with a linear ramp of 25-pixel width), Rainer, Augath, Trinath, and Logothetis (
2002) found that area V1 showed a linear relationship between activity and the amount of box scrambling; the more scrambled an image, the higher the activity (up to the second highest level of scrambling when activity dropped precipitously). In extrastriate cortex (V1, V2, V3A, V4), activity for highly box-scrambled and intact images were similar until the highest level of scrambling at which activity dropped, much like in V1. These findings are consistent with Singh, Smith, and Greenlee (
2000), who found that blood oxygenation level dependency increased in response to grating that increased from low to medium spatial frequencies.
The phase-scrambling method controls better for the spatial frequency content of the image. Intact images are decomposed into their constituent spatial frequencies using a Fourier transform. The phase values are then randomized, and the emerging scrambled versions are reconstructed using an inverse Fourier transform with the scrambled images containing the same power spectrum as the corresponding intact versions. However, the visual system (or the HMAX model) is sensitive to image features that result from the smoothly changing, continuous phase variations with frequency that is typical in natural images (Oppenheim & Lim,
1981; Thomson,
1999). Artifacts therefore result from changes to image properties as a result of randomized phase spectra, an inherent byproduct of this method. Thomson (
1999) has demonstrated that intact images contain higher-order statistical properties that are absent in the phase-scrambled images primarily driven by distortions in the local phase coherence. In fact, phase spectra have been shown to contain perceptually more important information than the power spectra (Oppenheim & Lim,
1981; Thomson,
1999). That is, local phase coherence is responsible for vital information, such as localized features, including lines, edges, and contours (Morrone & Burr,
1988). The loss of these structural properties and the importance of phase coherence can have significant perceptual consequences; the visual system is sensitive to harmonic phase relationships even at the earliest processing stages, such as V1 (Wang & Simoncelli,
2004), which is reflected in an increase in perceptual sensitivity to detecting distortions in phase-scrambled images (Bex,
2010; Kingdom, Field, & Olmos,
2007).
There have been attempts to improve the phase-scrambling method and remove some of its limitations. For instance, the approach proposed by Dakin et al. (
2002) improves second- (contrast) and fourth-order (kurtosis) statistics and avoids overrepresentation of certain phases but leads to nonuniform phase angle steps (Ales et al.,
2012). Ales et al. (
2012) modified it to produce images with coherent phase but at the cost of randomized amplitude. Future iterations of this method might lead to further improvements, but we believe that spatial domain techniques such as diffeomorphic warping are more readily suited to scrambling without altering distinct visual features.
Of the three test scrambling methods, it was the neural activity associated with texture-scrambled images that most closely resembled that of intact images. However, even at the earliest stages (S1, C1), differences emerged that became more pronounced at the latest stage in the model (C2). Texture scrambling was originally designed to synthesize visual textures with homogeneous and consistently repeating elements that are ideal for modeling higher-order statistics (Portilla & Simoncelli,
2000) and was only later applied to natural scenes. This scrambling method was never intended for isolated objects, which contain different properties from textures, and scenes that may partially explain the subtle differences in neural output. More critically, however, like the other methods, texture scrambling distorts the Gestalt of the image while creating irregularly shaped closed contours. The large differences at C2 may occur because of the grouping discrepancies between the intact and texture-scrambled images. Distorting the grouping properties of objects also produces perceptual consequences at early processing stages. Given the sensitivity of visual area V1 to spatial frequencies and varying sizes of the image set, the spatial frequency content of larger objects (lower spatial frequency) will be differentially altered relative to smaller images (higher spatial frequency) by long continuous contours (Rust & Dicarlo,
2010).
The diffeomorphic transformation did not change basic visual properties like the other scrambling methods. Diffeomorphic transformations are smooth, continuous, and invertible, so the topology (with no folding) was preserved, and the process could be reversed to re-recreate the original intact image. Moreover, the range of spatial frequencies was restricted using a discrete cosine basis, ensuring that high spatial frequency artifacts were not introduced in the scrambled image. The final result is that the early visual system (as modeled by Riesenhuber & Poggio,
2002) processed diffeomorphed images in much the same way as intact images.
This method provides a first step toward overcoming the limitations inherent to the conventional scrambling methods. Although diffeomorphic images are a significant improvement in the design of appropriate control stimuli, generating fully controlled stimuli is limited by our knowledge of the visual system. Preserving all basic visual features of each image would require a complete understanding of the tuning properties of all neurons along the visual pathway in addition to how they are influenced by connections to other neurons and behavioral objectives (e.g., top-down effects). To date, this has proven to be a considerable challenge. Attempts to find critical features driving neural activity at each processing stage often results in the need to generate idiosyncratic stimuli tailored for each neuron that make comparisons along the visual hierarchy untenable (Kobatake & Tanaka,
1994). In fact, scrambled images can be used to directly examine the properties of the visual system (Murray,
2011). For instance, Freeman, Ziemba, Heeger, Simoncelli, and Movshon (
2013) generated a model for creating synthesized images that served as visual metamers (perceptually indistinguishable from the intact version) to outline the receptive field sizes and sensitivity at different eccentricities of visual area V2, which predicts visual degradation in the periphery associated with crowding.
It should be noted that conventional image-scrambling methods may be useful in contexts that focus on specific object properties, such as demarcating cortical regions sensitive to the presence of edges (Kovesi,
2003). Or, for example, one may be interested in brain regions that process shape (e.g., lateral occipital complex), in which case it might be helpful to contrast stimuli with a defined shape with those that do not have one. Diffeomorphic stimuli have been designed for a particular scientific question in which visual properties are not the focus of interest.
A complementary approach to examine the representations along the visual pathway is to keep the stimuli constant but change the task requirements. As outlined by Schyns, Gosselin, and Smith (
2009), diagnostic features of images and reverse correlation can be used to link brain activity to functional cognitive states. We believe using diffeomorphed images might complement this approach well; the gradual warping of diffeomorphed images allows for the selection of certain diagnostic features (e.g., shape or semantics), which can be used to differentiate confounded neural activity due to correlated visual properties of images within categories (Rousselet, Pernet, Caldara, & Schyns,
2011).
The category-dependent perceptual ratings offer another instance of how diffeomorphed images can help further our understanding of object perception. Some categories were more affected by scrambling than others, suggesting that the visual system relies on differing sets of features to categorize objects. Banno and Saiki (
2011) found that humans use higher-order statistics when detecting animals in scenes, suggesting certain higher-order image properties may be more telling of object structure in some categories over others. Statistically regular features are also important for recognizing objects (Gerhard, Wichmann, & Bethge,
2013). Faces and bikes contain highly regular properties that occur in almost every exemplar (i.e., eyes above a nose above a mouth), and a category like fruit contains highly discernible properties, but there is little consistency across exemplars. Diffeomorphed images can be used to help outline these properties of object perception, which can then be used to guide the selection of better-matched control stimuli. In turn, extracting the mechanisms governing perception at earlier stages of perception can then be used to design better-matched control stimuli to help explicate the mechanisms at the highest perceptual stages.
In conclusion, we demonstrate that the simulated neural signal in response to diffeomorphed images more closely resembled the neural signal associated with intact images relative to the other scrambling methods. Moreover, the advantage of diffeomorphed images over phase and box scrambling persists across many levels of scrambling. This similarity is consistent across distinct stages of the visual hierarchy (modeled by the HMAX model). Because processing at the earliest stages is held constant, differences in neural activity at anterior visual areas cannot be influenced by the properties of the image or the nature of the information feeding in from posterior areas. We suggest that diffeomorphed images serve as better control stimuli and should be used in neuroimaging studies that aim to disentangle early from later visual processing in order to more rigorously examine the neural mechanisms underlying perception, attention, and memory of the real world in later stages of visual processing.