The decomposition of visual scenes into elements described by orientation and spatial frequency is well documented in the early cortical visual system. How such 2nd-order elements are sewn together to create perceptual objects such as corners and intersections remains relatively unexplored. The current study combines information theory with structured deterministic patterns to gain insight into how complex (*higher-order*) image features are encoded. To more fully probe these mechanisms, many subjects (*N* = 24) and stimuli were employed. The detection of complex image structure was studied under conditions of learning and attentive versus preattentive visual scrutiny. Strong correlations (*R* ^{2} > 0.8, *P* < 0.0001) were found between a particular family of spatially biased measures of image information and human sensitivity to a large range of visual structures. The results point to computational and spatial limitations of such encoding. Of the extremely large set of complex spatial interactions that are possible, the small subset perceivable by humans were found to be dominated by those occurring along sets of one or more narrow parallel lines. Within such spatial domains, the number of pieces of visual information (pixel values) that may be simultaneously considered is limited to a maximum of 10 points. Learning and processes involved in attentive scrutiny do little if anything to increase the dimensionality of this system.

*ternary*(composed of 3 contrast values, −1, 0, 1) textures generated by simple arithmetic rules applied recursively to initially random patterns. Single examples from each ensemble are given in Figure 1. Importantly, even for small collections of these patterns (>10) the average third-order (and lower) correlation functions (3CFs) of each ensemble are not significantly different from zero (Maddess et al., 2007). This is also the case for uniformly distributed noise patterns (pixel values assigned randomly to −1, 0, 1 with equal probability). This means the isotrigon textures are completely isotropic when considering measures that are third-order and below (see Methods). To discriminate such ensembles from each other and from noise, one must therefore learn ensemble-specific higher-order features (Victor, 1994; Victor et al., 1995). The majority of neurons in primate V1 have been shown to be sensitive to structure defined at fourth order and above when stimuli that are able to quantify this have been used (Purpura, Victor, & Katz, 1994). In the present case, the average probability of correctly differentiating each of the ensembles used here from noise textures provides a basis for relating particular measures of image information to processes in the visual system. Of central concern is how and to what extent a large range of complex spatial structures (defined by 4th and higher spatial correlations) are encoded. As such, the present investigation lies in the domain of form perception rather than texture processing, where for example the spatial integration of simple element properties such as orientation (Beck, Sutter, & Ivry, 1987; Field, Hayes, & Hess, 1993; Landy & Bergen, 1991) or periodicity (von der Heydt, Peterhans, & Dürsteler, 1992) are more of interest.

^{9}= 19683, while for isotrigon ensembles there are between 3 and 81 times fewer observable cliques (Maddess et al., 2007). In this regard, these textures are more natural than noise having similarly low dimensionality to natural images (Chandler & Field, 2007). The present ensembles also share a third-order property (evenly distributed bispectral “energy”) that is highly characteristic of regions of natural images that are fixated upon by human-observers (Krieger, Rentschler, Hauske, Schill, & Zetzsche, 2000). Given a much lower number of possible combinations of pixel values, it might be expected that all ensembles should appear more structured than noise patterns. Despite such statistical structure, psychophysics reveals that many ensembles are not differentiable from random patterns (Maddess & Nagai, 2001; Maddess et al., 2007). This suggests that some higher-order features are more readily detected than others.

*glider*(Figure 1. left column) is passed over each pixel of an initially

*ternary*(composed of three luminance values, −1, 0, 1) evenly distributed random pattern (Maddess & Nagai, 2001; Maddess et al., 2004, 2007; Victor & Conte, 1991). As each randomly assigned pixel falls under the output pixel of the glider, its value may be changed depending on the values of the pixels falling under the input pixels and one of a set of rules governing how inputs are combined to determine an output value. The particular rules insure the higher order properties of the textures (Maddess et al., 2007). The present texture types, also referred to as ensembles, are generated by 5 gliders (Figure 1) and 5 isotrigon rules M

_{0}, M

_{1}, I

_{0}, I

_{1}, and I

_{2}(Maddess et al., 2007). Texture ensembles generated by the same glider and different rules share some important relationships (Maddess et al., 2004).

*N*= 19683) occur within this set. An additional reason for employing the present ensembles is that they have been studied previously and therefore facilitate direct comparison between the present work and other studies.

*isotrigon*and is described by the 3rd order correlation function C

_{3,f(h1,v1,h2,v2)}(3CF) for an image

*I*(

*x, y*) comprised of

*N*pixels each with an area of 1

*x, y*) and two others at horizontally (

*h*) and vertically (

*v*) shifted locations. The 3CF is thus the third order analogue of the second order correlation function (2CF), i.e., the Fourier transform of which is the power spectrum of

*I*(

*x, y*). Explained in detail below, the mean 3CF of isotrigon ensembles is everywhere zero (as with evenly distributed noise patterns) (Maddess et al., 2004, 2007). To reiterate, this equality implies that only fourth and higher-order information can be used to identify a pattern as belonging to a particular ensemble or noise.

*t*-values computed for each of the 2560 (512 × 5) coefficients having a mean of 0.71 ± 0.16

*SD*. Our brains do not have the spatial frequency resolution of these spectra. To provide a more realistic model, power spectra were calculated using channels covering the same central frequencies for channels having bandwidths of 0.8 cpd. The coefficients are presented as

*t*-statistics for each region (Maddess et al., 2007) as shown in the rightmost column of Figure 2. The mean

*t*-statistic for horizontal and vertical coefficients for the 80 texture examples was 0.71 ± 0.15

*SD*with a maximum of value 1.15; in other words, no coefficient was significantly different from 0. Note also that the larger

*t*-values tend to originate from where the power spectrum is smallest. Even if the coefficients were significant they do not show a preponderance of horizontal or vertical elements, i.e., they are isotropic for this second order measure.

*t*-values for sets of 3 spectra (Figure 3). We examined 300 such sets of

*t*-values, the mean being presented in Figure 3. Here the

*t*-values are somewhat larger (but recall that for

*N*= 3 larger

*t*-values are required to reach significance), the mean spectrum remaining isotropic. The same was true for amplitude spectra rather than the power spectra. Hence, even for small collections there is no mean orientation bias. One could possibly look for significant horizontal or vertical components in pairs of different frequency bands in particular examples; however, the action of comparing power

_{1}AND power

_{2}is formally fourth order. Hence, while by definition there may be an orientation bias at fourth order or above there is none below that, even for these most unisotropic looking textures.

^{−2}. Testing was conducted in a darkened room in which ambient light was provided by the display monitor. Using a chin rest, subjects were required to fixate on a small dot in the center of the screen viewed binocularly at a distance of 60 cm. All software was written in Matlab (Matlab; The MathWorks, Natick, MA).

*N*pixel configurations (

*words*) is equi-probable (normal for these textures) (

*p*(

*x*

_{i}) = 1/

*N*), the equation for information entropy is simplified to

*H*(

*X*∣

*Y*)) (Cover & Thomas, 1991). The shapes of samplers, including their number of input pixels (Figure 4), were varied and the measured ensemble information compared with obtained psychometric functions. The sampling process is illustrated in Figure 5 where a sampler is placed on an example from the M

_{0}oblong ensemble (see Figure 1). The pixel combination, or word, is then recorded and the process repeated by shifting the sampler along to adjacent pixels. This process continues until the sampler has covered the entire texture, another example from the same ensemble is then analyzed in the same fashion. Multiple examples are sampled in this manner, ending when all words within an ensemble have been identified. The final result is the number of unique words (

*N*) occurring in a given ensemble and consequently the entropy of the ensemble (Equation 2).

*X*may contain every word in

*Y,*although such words may constitute only a small fraction of the total observable in

*X*(i.e.,

*X*has higher complexity). Both these factors are captured by a simple dimensionless measure (

*S*(

*X, Y*)) of the similarity between pairs of ensembles. The measure gives the ratio of the words shared between

*X*and

*Y*and the total number of unique words found in both

*X*and

*Y*:

*S*(

*X, Y*) is the cardinality (number of words) of the intersection of

*X*and

*Y*(∣

*X*∩

*Y*∣) divided by the cardinality (∣ ∣) of the union

*X*and

*Y*(∣

*X*∩

*Y*∣). Where no words are shared ∣

*X*∩

*Y*∣ = 0, and therefore

*S*(

*X, Y*) = 0. Conversely, where both ensembles share all their words

*S*(

*X, Y*) = 1. Therefore, in terms of shared words, a value close to 1 denotes high similarity between ensembles, while a value close to zero denotes little similarity. If a particular sampler perfectly captures the range over which local interactions are taken then

*S*(

*X, Y*) = 1, and it should be impossible to discriminate between examples of

*X*and

*Y*.

^{25}possible words in a 5 × 5 clique) only a subset of the total number were tested. For sample domain shapes and sizes ranging from lines of 4 pixels to 6

^{2}pixel matrices, this group included vertical, diagonal, and horizontal bar patterns, checker board patterns, all 36 matrices orthonormal of 6 × 6 Walsh/Hadamard functions (e.g., Figure 4, bottom right). The four input sampler in the second row and first column corresponds to an often explored type of easily perceived higher-order structure (e.g., Beason-Held et al., 1998; Victor & Conte, 2005). As this set reflects a diverse range of different configurations, at least one should be somewhat similar to the true shape of the human higher-order spatial bias if it exists, if so further “tuning” of the search is possible by focusing on variants of the successful samplers.

_{1}) produce more discriminable textures.

_{1}Box textures) or there is no discernable structure (e.g., I

_{1}Oblong) and therefore there is nothing to learn. It is interesting that for some ensembles performance might decrease with conscious scrutiny (Figure 6B; I

_{2}ZigZag, I

_{0}Box), although this was not significant when multiple comparisons were taken into account. Also notable are ensembles where it appears that average performance is below chance (e.g., M

_{0}Cross).

*R*

^{2}= 0.81,

*P*< 0.001) (Figure 7A right). Experienced (Figure 7B, gray) preattentive data are best matched by entropy measures based on pairs of short horizontal strips (

*R*

^{2}= 0.81,

*P*< 0.001). The attentive scrutiny data (Figure 7C, gray) are most correlated (

*R*

^{2}= 0.78,

*P*< 0.001) with measures based on samplers comprised of three strips containing a total of 10 pixels (though almost the same correlation may be reached with 9) (Figure 7C, black). Using similarly shaped samplers containing more pixels (not shown) significantly decreased the fit to psychometric data. The high similarity between the attentive and preattentive PFs meant that the best matching samplers for each were also good models of the alternative data set. In general, for a sampler to generate a function that was reasonably correlated with a PF, it needed to contain a horizontal or vertical strip-like domain (Figure 8) comprised of more than 4 pixels. Changing the pixel distance between bars of input pixels within samplers decreased the similarity between measured information and the attentive PF (not shown). A critical reason strip-like samplers performed so well (Figures 7A, 7B, 7C, and 8) is that, as with human observers, they failed to differentiate a number of ensembles from random (most notably Figure 7A). A reason for employing almost the full range of ensembles was to examine not only the type of visual structure that is detectable, but also that which lies outside detection.

*R*

^{2}= 0.52,

*P*= 0.07).

*S*(

*X, Y*) was employed to quantify such pairwise relationships. Figure 9 presents

*S*(

*X, Y*) values (between 0 and 1) for all pairs of ensembles calculated using the sampler that best matched the experienced attentive data (see Figure 7C). Darker pixels reflect lower values and therefore lower degrees of measured similarity. For example, pixels along the diagonal in Figure 9 are white because they give the similarity between each ensemble and itself. Other white domains in Figure 9 mostly correspond to high entropy ensembles (e.g., I

_{0}and I

_{1}Cross, Oblong, and Zig Zag ensembles), that is, those containing all words that are possible (in this case 59049) (see Figure 7C). Such textures appear unstructured (Figure 6) and therefore cannot be discriminated from one another, that is, they appear similar to each other and to noise patterns.

_{1}Box ensemble is predicted to most closely resemble the M

_{1}Corners and Oblong ensembles (Figure 9). Six examples from each of these 3 ensembles are given in Figure 10. It can be seen that the I

_{1}Box (Figure 10A) and Oblong (10B) examples share features such as rectangular domains of a single contrast value. The M

_{1}Corners ensemble is predicted (Figure 9) to be highly similar to the M

_{1}Oblong ensemble. The relevant examples (B and C) in Figure 9 appear to share some features in common, and this similarity was studied psychophysically.

_{1}Oblong and Corners ensembles. It was found that while the task was difficult, subjects can still discriminate between ensembles. Although the mean probability of correctly identifying an ensemble was low (mean probability = 0.6, standard deviation = 0.13), a

*t*-test revealed that discrimination was significantly above chance level (

*P*= 0.03). Nonetheless, the task is significantly harder than discriminating between M

_{1}Oblong examples and noise (

*P*< 0.001). This suggests that some of the visual structures that differentiate the M

_{1}Oblong ensembles from noise cannot be used to differentiate between M

_{1}Oblong and Corners ensembles. It might therefore be that most of the perceived structure lies within horizontal parallel domains.

_{1}Zig Zag and Cross ensembles should appear similar under attentive viewing conditions. Figure 11 gives six examples from these two ensembles. Perhaps the most noticeable similarity between the two ensembles is that both contain structures orientated at 45°. It is interesting that obliquely orientated structure may be captured by a sampler comprised of horizontal segments (Figure 4iii).

*K*th-order correlations within an

*N*pixel image. To have a complete

*K*th-order description, such a system would have to cast its input into a feature space. A feature space has (

*N*+

*K*− 1)! /

*K*!(

*N*− 1)! dimensions described by a basis set comprised of

*K*th-order products reflecting the order of the correlations being considered (Schölkopf, Smola, & Mülller, 1998). For example, a second order description (e.g., a power spectrum) is based on all possible second-order products between each pixel in an image. For a two pixel image {a, b}, the space would be three dimensional with a basis {a

^{2}, ab, b

^{2}}. With increasing

*N*or

*K,*a “combinatorial explosion” quickly ensues, for example, a complete 4th order description of a four pixel image would require a 35 dimensional space. Although the complexity of computing such correlations may be reduced using kernel methods (Schölkopf et al., 1998), the task remains extremely complex.

_{1}Oblong and Corners examples are almost indistinguishable, as predicted by the number of words that are shared between them.

*R*> 0.7) functions (e.g., Figure 8). Such consistency would be highly unlikely if the match between entropy functions and data were simply the result of spurious correlations. Moreover, the present findings are supported by other empirical work and have some theoretical appeal (Victor & Conte, 1989, 1991, 1996). Successful models of V1 visual evoked potential (VEP) responses to interchanges between random and isodipole/isotrigon stimuli share critical features with the current findings (Victor & Conte, 1989, 1991). Such models are comprised of two nonlinear stages in which rectified responses of a number of linear high-pass filters (e.g., Gabor or edge) arrayed along a line are combined (accelerating non-linearity) to generate a local response to higher-order structure. Both the number of filters and the spatial extent over which their responses are combined are consistent with our information theoretic investigation.