**Abstract**:

**Abstract**
**Visual systems learn through evolution and experience over the lifespan to exploit the statistical structure of natural images when performing visual tasks. Understanding which aspects of this statistical structure are incorporated into the human nervous system is a fundamental goal in vision science. To address this goal, we measured human ability to estimate the intensity of missing image pixels in natural images. Human estimation accuracy is compared with various simple heuristics (e.g., local mean) and with optimal observers that have nearly complete knowledge of the local statistical structure of natural images. Human estimates are more accurate than those of simple heuristics, and they match the performance of an optimal observer that knows the local statistical structure of relative intensities (contrasts). This optimal observer predicts the detailed pattern of human estimation errors and hence the results place strong constraints on the underlying neural mechanisms. However, humans do not reach the performance of an optimal observer that knows the local statistical structure of the absolute intensities, which reflect both local relative intensities and local mean intensity. As predicted from a statistical analysis of natural images, human estimation accuracy is negligibly improved by expanding the context from a local patch to the whole image. Our results demonstrate that the human visual system exploits efficiently the statistical structure of natural images.**

*z*represent the true (unknown) value of the missing pixel, and the

**c**represent the context of surrounding pixel values. The optimal estimate is given by the standard formula from Bayesian statistical decision theory: where

*γ*(

*z*,

*ẑ*) is the cost of making the estimate

*ẑ*when the true value is

*z*, and

*p*(

*z*|

**c**) is the posterior probability that the true value is

*z*given the observed context. For present purposes we assume the cost function is the squared error between the true value and the estimated value,

*γ*(

*z*,

*ẑ*) = (

*z*−

*ẑ*)

^{2}. For this cost function it is well known (e.g., Bishop, 2006) that the optimal estimate is the conditional mean of the posterior probability distribution (the so-called minimum mean squared error [MMSE] estimate):

*x*,

*y*), was

**c**= [

*z*(

*x*– 2,

*y*),

*z*(

*x*– 1,

*y*),

*z*(

*x*+ 1,

*y*),

*z*(

*x*+ 2,

*y*)] and the context vector in the vertical direction was

**c**= [

^{⊥}*z*(

*x*,

*y*– 2),

*z*(

*x*,

*y*– 1),

*z*(

*x*,

*y*+ 1),

*z*(

*x*,

*y*+ 2)]. The optimal estimates for these two contexts are

*ẑ*=

_{opt}*E*(

*z*|

**c**) and

*E*(

*z*|

**c**), and the combined estimate is given by where

^{⊥}*ρ*= 1/

_{opt}*Var*(

*z*|

**c**),

*Var*(

*z*|

**c**),

^{⊥}*ρ*= 1/

*Var*(

*z*), and

*u*=

*E*(

*z*). Equation 3 specifies the Bayesian optimal combination rule when two contexts ( and

^{⊥}), conditioned on the true value

*z*(

*x*,

*y*), are statistically independent and Gaussian distributed. When the variance of the prior is infinite (

*ρ*= 0), then Equation 3 reduces to the standard cue combination formula (Oruc, Maloney, & Landy, 2003).

*LumOpt8*observer, since it uses eight luminance (gray-level) values. We also consider the

*LumOpt4*observer that uses only the neighboring two luminance values in each direction.

*ẑ*(

*x*,

*y*) is the average value of

*z*in the 3 × 3 neighborhood centered on (

*x*,

*y*). The

*ConOpt8*and

*ConOpt4*observers are defined exactly as above, but for contrast images rather than luminance images. Thus, the context vectors now consist of the contrast-image values. We note that the local statistical structure of natural images changes with the local mean luminance, and hence estimates based on the statistics of contrast images will generally be less accurate than those based on the statistics of luminance images (see Geisler & Perry, 2011).

*z*) was varied from 0 to 255. At each gray value of the center pixel, we calculated the contrast of the center pixel

*z*as well as the optimal prediction

_{C}*z*is varied, the values of the context vectors may also vary because the local average,

*ẑ*(

*x*,

*y*), for some context pixels includes the center pixel being estimated. The gray level of the central pixel at which

*z*most nearly equals

_{C}*LumOpt8*observer, we find that

*ConOpt8*and

*ConOpt4*performance is slightly better when the two estimates are averaged rather than combined with relative reliability. Below we report the performance of the contrast observers based on averaging.

*LumMlr8*and

*ConMlr8*(least squares linear estimators based on the same eight pixels as

*LumOpt8*and

*ConOpt8*), and

*LumMlr4*and

*ConMlr4*(least squares linear estimators based on the same four pixels as

*LumOpt4*and

*ConOpt4*). These linear models were also trained on the natural images. Finally, we considered several simpler model observers:

*Mean8*(the average of the surrounding eight pixels),

*Mean24*(the average of the surrounding 24 pixels),

*Median4*(median of the four nearest pixels),

*Median8*(the median of the surrounding eight pixels), and

*Median24*(the median of the surrounding 24 pixels). We also consider a

*NoContext*observer, which has no knowledge of the spatial context of the central pixel, and therefore uses only the prior on gray levels in natural images to estimate the missing pixel value.

^{2}. Each image pixel in the presented patches had a visual angle of 4 arc min (4 × 4 display pixels). This size was picked so that the individual test pixel was clearly visible, yet the image appeared relatively smooth and continuous.

*ẑ*, for the missing pixel

_{opt}*z*(

*x*,

*y*) given the two neighboring horizontal pixels,

*z*(

*x*− 1,

*y*),

*z*(

*x*+ 1,

*y*). In the figure,

*s*and

*t*represent the neighboring pixels' values. The horizontal and vertical axes give the 8-bit gray values of

*s*and

*t*and the color scale gives the optimal estimate. As expected, swapping the values of

*s*and

*t*does not change the optimal estimate, and hence the plot is symmetric about the diagonal.

*s*and

*t*, then the contours of constant color in this plot would be straight lines with a slope of −1.0 (Geisler & Perry, 2011). As can be seen, there are substantial systematic deviations from the simple average, and the deviations are in different directions in different regions of

*s*-

*t*space. Therefore, it is important to sample from the different regions of the space. At the same time, we want to sample from regions of the space that are not extremely rare. The white points in the central plot of Figure 4 show the values of

*s*and

*t*from which the samples were drawn. The values of

*s*and

*t*along the diagonal are the pairs that occur most frequently. The values of

*s*and

*t*that are off the diagonal occur less frequently than those along the diagonal, but with equal frequency to one another.

*r*,

*s*,

*t*,

*u*). These plots show that for fixed values of

*s*and

*t*, the optimal estimate can vary dramatically, depending on the specific values of the more distant pixels

*r*and

*u*. Therefore, to tile the range of natural image patches, and to test whether the visual system incorporates the statistical structure revealed in Figure 4, we selected patches whose values of

*r*,

*s*,

*t*, and

*u*fell within the open circles of the outer plots. Two patches were randomly selected from each circle.

*p*given by the measured value (proportion of “brighter” responses) at that level, and

*n*equal to 30. These randomly drawn points were then refit with a cumulative Gaussian using a maximum likelihood procedure. This was repeated 10,000 times, and the resulting distributions for the mean and standard deviation of the cumulative Gaussian were used to generate 95% confidence intervals (±2

*σ*).

*i*

^{th}patch. The average prediction error (PE) between pairs of observers is 87, which is small relative to their MSEs. In other words, the error between observers' estimates is much smaller than the error between the observers' estimates and the true values.

*LumOpt8*observer are substantially more accurate than those of the human observers (MSE = 92 vs. MSE = 215). This indicates that there is substantial statistical structure in natural images that the human visual system does not exploit efficiently. On the other hand the

*Mean24*observer performs far worse than humans (MSE = 1814 vs. MSE = 215). The multiple linear regression observer that uses the four nearest pixels (

*LumMlr4*) also performs worse than humans (MSE = 363). The model that best matches overall human estimation accuracy is the

*ConOpt4*observer (MSE = 203). The natural image statistics upon which the

*ConOpt4*observer is based are shown in Figure 3b.

*LumOpt8*observer is much lower than that of the

*ConOpt8*observer. This implies that there is considerable useful information contained in the absolute gray levels that is not contained in the relative gray levels (see also Geisler & Perry, 2011). Second, the MSE of the

*LumOpt4*observer is higher than that of both the

*ConOpt8*and

*ConOpt4*observers, which are similar to each other. Presumably, this occurs because

*ConOpt4*observer's estimates incorporate pixel values over a larger area than the

*LumOpt4*observer. Third, the MSEs of the observers based on the local median and the mean are similar and much higher than the MSE of the human observers. The MSEs of the

*Median4*and

*LumMlr4*observers are similar. This is expected since

*LumMlr4*(in this case) is similar to the mean of the four nearest pixels. Fourth, the

*ConMlr*observers perform better than the

*ConOpt*observers. This unexpected result occurs because the model observers are optimized based on the entire training set of natural image patches. On both the training set and test set, which each consisted of many millions of patches, the

*ConOpt*observers perform substantially better than the

*ConMlr*observers. Thus, the reversal is only for the specific set of 62 patches in the experiment.

Observer | MSE | PE |

NoContext | 8,897 | 7,717 |

LumOpt8 | 92 | 107 |

LumOpt4 | 297 | 95 |

ConOpt8 | 164 | 48 |

ConOpt4 | 203 | 34 |

LumMlr8 | 111 | 144 |

LumMlr4 | 363 | 87 |

ConMlr8 | 129 | 57 |

ConMlr4 | 160 | 41 |

Mean24 | 1,811 | 1,024 |

Mean8 | 590 | 186 |

Median24 | 1,727 | 1,051 |

Median8 | 580 | 256 |

Median4 | 343 | 85 |

Human | 215 | — |

*ConOpt4*observer is the smallest (Table 1). Interestingly, the prediction error of all the

*ConOpt*and

*ConMlr*observers are substantially lower than the prediction error of the other models. This result suggests that for any randomly chosen natural image patch, the local contrast-image statistics of natural images predict (with good accuracy) both the magnitude and sign of human estimation errors in the pixel estimation task.

*LumOpt8*observer on the 62 test patches (MSE = 91 vs. MSE = 92), but were 4% less accurate on five million randomly selected test patches (MSE = 14.51 vs. MSE = 13.97).

*f*amplitude spectra of natural images (Deriugin, 1956; Field, 1987).

*ConMlr*model (outside the family of models in Table 1) that predicts human errors slightly more accurately than the

*ConOpt4*model in Table 1.

*Annual Review of Neuroscience*, 25, 339–379. [CrossRef] [PubMed]

*Current problems in animal behavior*(pp. 331–360). Cambridge, UK: Cambridge University Press.

*Proceedings of SPIE Human Vision & Electronic Imaging XII*, 6492, 1–12.

*Pattern recognition and machine learning*. New York: Springer.

*Vision Research*, 33, 105–116. [CrossRef] [PubMed]

*Journal of Vision*, 8 (5): 15, 1–23, http://www.journalofvision.org/content/8/5/15, doi:10.1167/8.5.15. [Pubmed] [Article] [CrossRef] [PubMed]

*American Journal of Psychology*, 66, 20–32. [CrossRef] [PubMed]

*Journal of Neuroscience*, 30, 7269–7280. [CrossRef] [PubMed]

*Telecommunications*, 1 (7), 1–12.

*Journal of the Optical Society of America*, 4, 2379–2394. [CrossRef] [PubMed]

*Vision Research*, 23, 173–193. [CrossRef]

*Journal of the Optical Society of America*, 20, 1283–1291. [CrossRef] [PubMed]

*Nature Neuroscience*, 16, 974–981. [CrossRef] [PubMed]

*Annual Review of Psychology*, 59, 167–192. [CrossRef] [PubMed]

*Visual Neuroscience*, 26, 109–121. [CrossRef] [PubMed]

*Journal of Vision*, 11 (12): 14, 1–7, http://www.journalofvision.org/content/11/12/14, doi:10.1167/11.12.14. [PubMed] [Article] [CrossRef] [PubMed]

*Vision Research*, 41, 711–724. [CrossRef] [PubMed]

*PLoS Computational Biology*, 9 (1), e1002873, doi:10.1371/journal.pcbi.1002873.

*Journal of Vision*, 5 (5): 5, 444–454, http://www.journalofvision.org/content/5/5/5, doi:10.1167/5.5.5. [PubMed] [Abstract] [CrossRef]

*Journal of Vision*, 10 (4): 10, 1–19, http://www.journalofvision.org/content/10/4/10, doi:10.1167/10.4.10. [PubMed] [Article] [CrossRef] [PubMed]

*Cognitive Psychology*, 23, 141–221. [CrossRef] [PubMed]

*Journal of the Optical Society of America*, 4 (12), 2395–2400. [CrossRef] [PubMed]

*Zeitschrift fur Naturforschung C*, 36

*,*910–12.

*Journal of the Optical Society of America*, 3, 1673–1683. [CrossRef] [PubMed]

*Vision Research*, 43, 2451–2468. [CrossRef] [PubMed]

*Annual Review of Neuroscience*, 24, 1193–1216. [CrossRef] [PubMed]

*Proceedings of the National Academy of Sciences*

*, USA*, 102, 939–944. [CrossRef]

*Vision Research*, 21, 1341–1356. [CrossRef] [PubMed]