Sensory systems exploit the statistical regularities of natural signals, and thus, a fundamental goal for understanding biological sensory systems, and creating artificial sensory systems, is to characterize the statistical structure of natural signals. Here, we use a simple conditional moment method to measure natural image statistics relevant for three fundamental visual tasks: (i) estimation of missing or occluded image points, (ii) estimation of a high-resolution image from a low-resolution image (“super resolution”), and (iii) estimation of a missing color channel. We use the conditional moment approach because it makes minimal invariance assumptions, can be applied to arbitrarily large sets of training data, and provides (given sufficient training data) the Bayes optimal estimators. The measurements reveal complex but systematic statistical regularities that can be exploited to substantially improve performance in the three tasks over what is possible with some standard image processing methods. Thus, it is likely that these statistics are exploited by the human visual system.

*r*,

*s*,

*t*, and

*u*represent observed values along a row of image pixels and

*x*represents an unobserved value to be estimated. This kind of task arises when some image locations are occluded or a sensor element is missing. A different variant of the task arises when only some of the color values are absent at a location (e.g., the demosaicing task, Brainard, Williams, & Hofer, 2008; Li, Gunturk, & Zhang, 2008, and related tasks, Zhang & Brainard, 2004). The second task is estimation of a high-resolution image from a low-resolution image (Figure 1b). Again, the symbols

*r*,

*s*,

*t*, and

*u*represent observed low spatial resolution values obtained by locally averaging and then downsampling a higher resolution image;

*x*and

*y*represent unobserved values to be estimated. This kind of task arises naturally in interpreting (decoding) retinal responses in the periphery. For example, Figure 1b corresponds approximately to the situation around 1-degree eccentricity in the human retina where the sampling by midget (P) ganglion cells is about one-half that in the fovea (one-fourth the samples per unit area). In the computer vision literature, the goal of this task is referred to as “super resolution” (Freeman, Thouis, Jones, & Pasztor, 2002; Glasner, Bagon, & Irani, 2009; Li & Adelson, 2008). The third task is estimation of a missing color channel given the other two. The symbols

*s*and

*t*represent the observed color values (e.g.,

*G*and

*B*) at a pixel location, and

*x*represents the unobserved value to be estimated (e.g.,

*R*). This is not a natural task, but it reveals the redundancy of color in natural scenes and sets an upper limit on how well a dichromat could estimate his missing color channel without knowledge of the spatial correlations in images or the objects in the scene. Furthermore, statistics measured for this task could be of use in other tasks (e.g., demosaicing). We find that the image statistics for these tasks are complex but regular (smooth), are useful for performing the tasks, and make many testable predictions for visual coding and visual performance.

**z**(e.g., image patch): subtract the mean of the signal (

**z**′ =

**z**−

**z**′ = (

**z**−

**z**′ = (

**z**−

**z**−

**z**′

^{ T }=

**W**(

**z**−

^{ T }). Usually, these normalization steps are applied to simplify the mathematics or reduce the dimensionality of the estimation problem. The assumption is that the normalization preserves the important statistical structure. For example, it is plausible that the statistical structure of image patches is largely independent of their mean luminance, if they are represented as contrast signals by subtracting and dividing by the mean. Similarly, standard image processing algorithms for some of the point prediction tasks considered here implicitly assume that subtracting off the mean luminance of the observed signal preserves the relevant statistical structure. Thus, in both cases, the assumption is effectively that

*p*(

**z**) =

*p*(

**z**′)

*p*(

*m*

_{ R }= 2.214791,

*m*

_{ G }= 1.0, and

*m*

_{ B }= 1.193155 (which are determined by the camera make and model) to obtain a 16-bit RGB image. In the case of gray scale, YPbPr color space conversion was used (Gray = 0.299

*R*+ 0.587

*G*+ 0.114

*B*). The dynamic range was increased by blacking out 2% of the darkest pixels and whiting out 1% of the brightest pixels. Blacking out pixels almost never occurred because in most images 2% of the pixels were already black. On the other hand, whiting out 1% of the brightest pixels has an effect because the image histograms have such long tails on the high end. For some measured statistics, the images were then converted to

*LMS*cone space. Finally, the images were converted to linear 8-bit gray scale or 24-bit color. The images were randomly shuffled and then divided into two groups: 700 training images that were used to generate the tables and 349 test images. All pixels in the training images were used to estimate conditional moments, and thus, the number of training samples was on the order of 10

^{10}.

*n*dimensions can be written as a product of single dimensional conditional probability density functions:

**x**

_{ i−1}= (

*x*

_{1}, …,

*x*

_{ i−1}) and

**x**

_{0}= ∅. Furthermore, the moments of a single dimensional probability distribution function defined over a finite interval (as in the case of image pixel values) are usually sufficient to uniquely specify the density function (e.g., Fisz, 1967). Thus, the full joint probability density function can be characterized by measuring the single dimensional conditional moments:

*k*is the moment (e.g.,

*k*= 1 gives the mean of the posterior probability distribution of

*x*

_{ i }given

**x**

_{ i−1}). As is standard, these moments can be estimated simply by summing the observed values of

*x*

_{ i }

^{ k }for each unique (or quantized) value of the vector

**x**

_{ i−1}and then dividing by the number of times the vector

**x**

_{ i−1}was observed:

**x**

_{ i−1}) is the set of observed values of

*x*

_{ i }for the given value of the vector

**x**

_{ i−1}, and

*N*(

**x**

_{ i−1}) is the number of those observed values.

*MMSE estimation function*for

*x*

_{ i }to be

*x*

_{ i }when the goal is to minimize the mean squared error of the estimate from the true value (e.g., Bishop, 2006). The reliability of these estimates is given by

*x*

_{ i }∣

**x**

_{ i−1}) =

*E*(

*x*

_{ i }

^{2}∣

**x**

_{ i−1}) −

*E*(

*x*

_{ i }∣

**x**

_{ i−1})

^{2}. The focus of most of this paper is on measuring and applying

*f*(

**x**

_{ i−1}) and

*ρ*(

**x**

_{ i−1}).

_{1}be the estimate of

_{2}be the estimate of

*x*given a different set of conditional variables, and let

_{1}and

_{2}be their estimated reliabilities. If the estimates are Gaussian distributed and uncorrelated, then the optimal estimate given the combined conditional variables is

*γ*=

_{1}/ (

_{1}+

_{2}) is the relative reliability of the two estimates (e.g., Oruç, Maloney, & Landy, 2003; Yuille & Bulthoff, 1996). The more general combination rules for the case where the estimates are correlated are known (e.g., see Oruç et al., 2003); however, there are not enough training images to estimate these correlations. Nonetheless, applying Equation 6 can increase the accuracy of the estimates.

*x*given the two flanking values

*s*and

*t*(see Figure 1a). The direct method of measuring the

*k*th moment,

*E*(

*x*

^{ k }∣

*s*,

*t*), is simply to compute the running sum of the values of

*x*

^{ k }for all possible values

*s*,

*t*, using every pixel in the 700 training images (see Equation 3). The result is a 256 × 256 table for each moment, where each table entry is for a particular pair of flanking values. These tables were estimated separately for horizontal and vertical flanking points; however, the tables did not differ systematically and hence were combined. This property held for the other cases described below, and hence, we show here only the combined tables.

*s*and

*t*can swap locations without affecting the statistics. Some sense of what these tables represent can be gained from Figure 2f, which plots normalized histograms of

*x*for the five pairs of flanking values indicated by the dots in Figures 2a–2d. These distributions are approximately unimodal and so are relatively well described by the first four moments.

*x*given a flanking pair of points. The table of conditional second moments can be used to combine separate MMSE estimates (e.g., estimates based on horizontal and vertical flanking points; see later). Maximum

*a posteriori*(MAP) estimates (or estimates based on other cost functions) should be possible using the four moments to estimate the parameters of a suitable family of unimodal density functions. However, the focus in the remainder of this paper is on MMSE estimates.

*x*from just the two nearest flanking values (

*s*,

*t*). The MMSE estimate function

*f*(

*s*,

*t*) is given by the table in Figure 2a. If the joint probability distribution

*p*(

*x*,

*s*,

*t*) were Gaussian, then (given two obvious constraints that apply in our case: when

*s*=

*t*, the optimal estimate is

*s,*and symmetry about the diagonal) the optimal estimate would be the average of the two flanking values:

*f*(

*s*,

*t*) = (

*s*+

*t*) / 2 (see 1). Similarly, if the joint probability distribution

*p*(

*x*,

*s*,

*t*) is Gaussian on log axes, then the optimal estimate would be the geometric average of the two flanking values:

*f*(

*s*,

*t*) =

*f*(

*s*,

*t*) =

*f*(

*t*,

*s*). In other words, the optimal estimates depend on the specific absolute values of

*s*and

*t*, not only on their relative values. The lack of obvious invariance in Figure 3b shows that less direct methods of measuring the image statistics could have missed important structure. For example, suppose each signal were normalized by subtracting the mean of

*s*and

*t*before measuring the conditional first moments, and then the mean was added back in to create a table like Figure 2a. If this procedure did not miss any information, then the difference values plotted in Figure 3b would be constant along every line parallel to the diagonal (see 1). In fact, most interpolation methods (e.g., linear, cubic, and spline methods) ignore the mean signal value and consider only the difference values. Similarly, if normalizing each signal by subtracting and dividing by the mean of

*s*and

*t*before measuring the conditional first moments did not lose information, then the difference values in Figure 3b would be constant along any line passing though the origin (see 1). In other words, these common preprocessing steps would have led us to miss some of the statistical structure shown in Figure 3b. Similar conclusions are reached if the log Gaussian prediction is used as the reference, suggesting that the complex structure and the lack of invariance are not due to the non-Gaussian shape of the marginal distribution of intensity values. As a further test, we quantized the 14-bit raw images into 8-bit images having an exactly Gaussian histogram, and then ran the above analyses. The resulting plots are qualitatively similar to those in Figure 3b.

*x*from all four neighboring values (

*r*,

*s*,

*t*,

*u*). In this case, the goal is to measure the MMSE estimation function:

*s*,

*t*). In other words, we measure 2

^{16}tables of the form

*f*

_{ s,t }(

*r*,

*u*). This set of two-dimensional tables constitutes the full four-dimensional table.

*x*for all possible flanking point values is not practical given the amount of image data we had available. One way to proceed is to lower the resolution of the table for the furthest flanking points. Reducing the resolution is justifiable for two reasons: (i) the furthest flanking points should generally have less predictive power, and (ii) reducing resolution is equivalent to assuming a smoothness constraint, which is plausible given the smoothness of the third-order statistics (plot in Figure 2a). We quantized the values of

*r*and

*u*into 16 bins each:

*q*(·) is a quantization function that maps the values of

*r*and

*u*into 16 integer values between 0 and 255.

*s*,

*t*) corresponding to the white letters in Figure 3b. As in Figure 2a, the color indicates the optimal estimate. (In the plot, gray indicates values of (

*r*,

*s*,

*t*,

*u*) that did not occur in the training set.) Again, the statistical structure is complex but relatively smooth and systematic. Consider the tables for values of (

*s*,

*t*) that are along the diagonal (

*s*=

*t*). When the values of

*r*and

*u*are below

*s*and

*t,*then (

*r*,

*s*,

*t*,

*u*) is consistent with a convex intensity profile and the estimate is above the average of

*s*and

*t*(see Figure 3g). On the other hand, when the values of

*r*and

*u*exceed

*s*and

*t*, then (

*r*,

*s*,

*t*,

*u*) is consistent with a concave intensity profile and the estimate is below the average of

*s*and

*t*(see Figure 3h). Similar behavior is seen in the tables for values of (

*s*,

*t*) that are not along the diagonal, but the tables are less symmetric. These are intuitive patterns of behavior, but again, there are no obvious forms of invariance except for the diagonal symmetry when

*s*=

*t*.

*r*,

*s*,

*t*,

*u*). There is considerable variation in the relative reliability, suggesting that this measure contains useful information. This large variation may be due to the structure of intensity gradients in natural images. As shown in Figure 2b, estimates based on image points that are near equal are more reliable than estimates based on image points that are less equal. However, image points that are relatively parallel to the local intensity gradient are closer to equal than image points that are relatively perpendicular to the local intensity gradient. In the performance demonstrations below, we combine horizontal and vertical estimates (

_{H}and

_{V}) using relative reliability (see Equation 6 in the Methods section).

*x*that are aligned with low-resolution points and those like

*y*that are centered between low-resolution points. Following similar logic as before, the Bayes optimal estimate of

*x*is given by

*E*(

*x*∣

*r*,

*s*,

*t*) and the optimal estimate of

*y*is given by

*E*(

*y*∣

*r*,

*s*,

*t*,

*u*), and thus, our aim is to measure the functions

*f*

_{ s }(

*r*,

*t*) for the aligned points (

*x*) and the functions

*f*

_{ s,t }(

*r*,

*u*) for the between points (

*y*). To measure these functions, we first filtered all of the calibrated images with a 3 × 3 Gaussian kernel and then downsampled them by a factor of two in each direction to obtain a set of low-resolution images together with their known higher resolution “ground-truth” images. As before, we quantized the more distant image points, and thus, we measured

*f*

_{ s }(

*q*(

*r*),

*q*(

*t*)) and

*f*

_{ s,t }(

*q*(

*r*),

*q*(

*u*)), where

*q*(·) is a quantization function.

*x*) are shown in Figures 5b and 5c. The plots show the optimal estimate of

*x*. As can be seen, if the values of

*r*and

*t*are greater than

*s,*then the optimal estimate of

*x*is less than

*s*(see Figure 5d), and vice versa if the values of

*r*and

*t*are less than

*s*(see Figure 5e). The tables measured for the between points (

*y*) are similar to those in Figures 3c–3f. The tables plotted in Figure 5 are intuitively reasonable and reflect both the structure of the natural images and the effect of the blur kernel.

*x*) is shown by the red curve in Figure 4a. The relative reliability is concentrated around 0.5, suggesting that this measure contains less useful information than in the case of the missing image point task. Indeed, we found that more useful information is contained in the recursive conditional first moments (see Equation 7 in the Methods section). Figure 4b plots the estimate of

*x*given

_{H}and

_{V}minus the average of the two estimates, for a relative reliability (

*γ*) of 0.5. Again, the statistics are complex but regular. If the estimates nearly agree, then the best estimate is near the average of the two estimates. If the estimates do not agree, then the best estimate is higher than the average when the average is big but lower than the average when the average is small.

*R*,

*G*, and

*B*sensors in the camera, and the other where the three channels are the

*L*,

*M*, and

*S*cones in the human retina. In other words, we measured the following statistics:

*E*(

*R*∣

*G*,

*B*),

*E*(

*G*∣

*R*,

*B*),

*E*(

*B*∣

*R*,

*G*),

*E*(

*L*∣

*M*,

*S*),

*E*(

*M*∣

*L*,

*S*), and

*E*(

*S*∣

*L*,

*M*). Recall that these first conditional moments are the Bayes optimal estimates when the goal is to minimize the mean squared error.

*f*(

*r*,

*s*,

*t*,

*u*) in both directions and combining the two estimates using the measured relative reliability. This is what we call the

*biprior*estimator. We also estimated the value of each pixel using the standard bilinear estimator. Estimation accuracy is quantified as the peak signal-to-noise ratio, PSNR = 10log(255

^{2}/MSE), where MSE is the mean squared error between the estimated and true values. Thus, a 50% decrease in MSE corresponds to a 3-db increase in PSNR. Figure 7a plots for each test image the difference in the PSNR of the two estimators as a function of the PSNR of the bilinear estimator. All blue points above zero indicate images for which the biprior method was more accurate. In general, the biprior method performs better. The increases in performance are substantial (3.7 db on average) and demonstrate the potential value of the natural image statistics in estimating missing or occluded points.

*f*(

*s*,

*t*) in both directions and combining with relative reliability), the increase in performance over bilinear is 0.9 dB. This shows that the complex statistical structure shown in Figures 2 and 3a is useful, but it also shows that the additional flanking pixels make a substantial contribution to performance.

*r*,

*s*,

*t*,

*u*). None of these performed well. Next, we explored one- and two-dimensional kernels obtained using a recent method (accuracy maximization analysis) that finds kernels that are optimized for specific identification and estimation tasks (Geisler, Najemnik, & Ing, 2009). In the one-dimensional case, we found that the AMA kernels were essentially equivalent to (covered approximately the same subspace as) the four kernels corresponding to the separate 4 pixels in the horizontal or vertical direction and did not produce a significant improvement in performance. In the two-dimensional case, we found that the four AMA kernels performed much better than the estimates based on only one direction but did not perform better than the combined vertical and horizontal estimates. Finally, we tried combining the horizontal and vertical estimates with the full recursive table (Equation 7) and found only a very slight increase (less than 1%) in performance over using Equation 6. While none of these analyses is definitive, together they suggest that the biprior estimator makes near optimal use of the local image structure in natural images for the 2D missing image point task.

*x*points having 256 levels (8 bits) each for

*r*,

*s*, and

*t*. The horizontal and vertical estimates for the

*x*points were then combined using the recursive table (see Equation 7 and Figure 4b). These estimated

*x*points were then used as the four flanking points to estimate the

*y*points. One final point in each 3 × 3 block (a

*z*point not shown) was estimated by combining vertical and horizontal estimates from the flanking estimated

*y*points, using the recursive table. The results are shown in Figure 7b. The biprior estimator performs substantially better than the standard methods on all the test images, demonstrating the potential value of the measured natural image statistics in decoding photoreceptor (or ganglion cell) responses. As in the case of estimating the missing point, all components of the biprior estimator make a substantial contribution to performance. However, unlike missing point estimation, using only the reliability to combine horizontal and vertical estimates (Equation 6) did not work as well as using the recursive table, presumably because of the smaller variation in relative reliability (red curve in Figure 4a).

*L**

*a**

*b** uniform color space. The average increase in PSNR of the optimal estimate over that of linear regression is as follows for each color channel: (

*R*,

*G*,

*B*) = (1.2 db, 0.59 db, 0.93 db) and (

*L*,

*M*,

*S*) = (1.3 db, 0.91 db, 2.3 db). Note that linear regression incorporates all the pairwise correlations between the color channels (i.e., the full covariance matrix). Also, note that there are many locations in the color spaces of Figure 6 for which there is no data (the gray areas), and thus, one might wonder if performance would be better if the spaces were filled in better. In fact, this is not an issue. For the random set of test images, there were very few occasions when an empty location was encountered (less than 1 pixel per two images).

*R*,

*G*,

*B*) = (62%, 26%, 12%) and (

*L*,

*M*,

*S*) = (78%, 22%, 0%). Figure 7d plots the difference in PSNR between the optimal estimator and the linear estimator (linear regression) as a function of the PSNR for the linear estimator, for the channel that was easiest to estimate. (Usually, the channel that was easiest to estimate was the same for the optimal and linear estimators, but in a small percentage of cases, the best channel was different.) The upper panel plots accuracy for a missing

*R*,

*G*, or

*B*channel and the lower panel for a missing

*L*,

*M*, or

*S*channel. Points above the horizontal line indicate images where the optimal estimate was more accurate than the linear regression estimate.

*r*and

*u*are below

*s*and

*t*in

*r*,

*s*,

*t*,

*u*space (see Figure 3), observers are predicted to estimate gray levels above the average of

*s*and

*t*, and the opposite when

*r*and

*u*are above

*s*and

*t*. Further, the uncertainty in the observers' estimates (e.g., psychometric function slopes) should be greater in those parts of the space where the standard deviations of the conditional moment distributions are greater (see Figure 2b).

*L*,

*M*, and

*S*cones, as well as the prior probability distribution over natural images. They assumed that the prior is Gaussian and separable over space and color. The present statistics show that the prior over color is not Gaussian and that color estimation is more accurate if the joint statistics are taken into account. Thus, the present results may lead to somewhat different predictions for estimation experiments like those of Hofer et al. (2005).

^{10}and much larger training sets are possible. However, measuring the conditional moments is not feasible once the signal dimensionality becomes sufficiently high (although the dimensionality can be extended somewhat by measuring recursive conditional moments that take into account reliability; see Equations 6 and 7 in the Methods section). In high dimensions, parametric models are the most viable approach. Nonetheless, even when the dimensionality is high, conditional moments measured in lower dimensions may be useful for constraining parameters and testing the assumed forms of invariance in the parametric models. An important two-dimensional example of this was the discovery that the unsigned magnitudes of nearby orthogonal wavelet values are positively correlated, even when their signed values are uncorrelated (Buccigrossi & Simoncelli, 1999).

*p*(

**x**

_{ a }∣

**x**

_{ b }) has a mean value that is a linear function of

**x**

_{ b }and a covariance that is independent of the value of

**x**

_{ b }. Therefore, in the case of interpolation between two points, the optimal estimate is given by

*s*=

*t*, we always find (in agreement with intuition) that

*w*

_{0}= 0. Second, we observe (in agreement with intuition) symmetry about the diagonal (see Figure 1b), which requires that

*w*

_{1}=

*w*

_{2}=

_{opt}=

_{opt}=

_{opt}=

*f*(

*s*,

*t*) =

*E*(

*x*∣

*s*,

*t*) (see Figures 3a and 3b).

*s*

_{ i },

*x*

_{ i },

*t*

_{ i }) be an arbitrary original training signal and let

*m*

_{ i }be the local mean (

*s*

_{ i }+

*t*

_{ i }) / 2 that is subtracted off. The transformed values of

*s*

_{ i },

*x*

_{ i }, and

*t*

_{ i }are given by

*s*

_{ j },

*t*

_{ j }) for which (

*s*′

_{ i },

*t*′

_{ i }) = (

*s*′

_{ j },

*t*′

_{ j })? For any given (

*s*

_{ i },

*t*

_{ i }), the (

*s*

_{ j },

*t*

_{ j }) must satisfy the following two equations:

_{opt}(

*s*,

*t*) in Figure 3b would have a constant value along any line parallel to the central diagonal, which does not hold except for the central diagonal itself. In other words, applying the preprocessing step of subtracting the mean would miss much of the important statistical structure shown in Figure 3b. The same result holds if the subtracted mean is (

*s*

_{ i }+

*x*

_{ i }+

*t*

_{ i }) / 3.

*s*

_{ j },

*t*

_{ j }) for which (

*s*′

_{ i },

*t*′

_{ i }) = (

*s*′

_{ j },

*t*′

_{ j })? For any given (

*s*

_{ i },

*t*

_{ i }), the (

*s*

_{ j },

*t*

_{ j }) must satisfy the following two equations:

_{opt}(

*s*,

*t*) in Figure 3b would have a constant value along any line passing through the origin, which does not hold except for the central diagonal itself. Thus, applying this preprocessing step would also miss much of the important statistical structure. A similar result holds if the subtracted mean is (

*s*

_{ i }+

*x*

_{ i }+

*t*

_{ i }) / 3.