**Abstract**:

**Abstract**
A hierarchical definition of optical variability is proposed that links physical magnitudes to visual saliency and yields a more reductionist interpretation than previous approaches. This definition is shown to be grounded on the classical efficient coding hypothesis. Moreover, we propose that a major goal of contextual adaptation mechanisms is to ensure the invariance of the behavior that the contribution of an image point to optical variability elicits in the visual system. This hypothesis and the necessary assumptions are tested through the comparison with human fixations and state-of-the-art approaches to saliency in three open access eye-tracking datasets, including one devoted to images with faces, as well as in a novel experiment using hyperspectral representations of surface reflectance. The results on faces yield a significant reduction of the potential strength of semantic influences compared to previous works. The results on hyperspectral images support the assumptions to estimate optical variability. As well, the proposed approach explains quantitative results related to a visual illusion observed for images of corners, which does not involve eye movements.

**X**= (

*x*

_{1}, … ,

*x*) the original representation (with

_{M}*M*components),

**Y**= (

*y*

_{1}, … ,

*y*) the whitened representation, and the respective covariance matrices with, in general

_{M}*x*≠ 0, while this whitening transformation can be expressed as a matrix product where

_{ij}*W*is usually referred to as the unmixing matrix.

*Y*vector associated to that sample. That is by $ \Vert Y \Vert 2 = \u2211 i M y i 2 , $ being

*M*the number of components.

*M*, of possible spatial frequency radii

_{λ}*M*and of possible spatial frequency angles

_{ρ},*M*, with a certain bandwidth on each dimension. As well, only a finite number of image points acting as samples can be considered. Thereby, we assume the corresponding approximations and change integrals by sums in the equations drawn in the appendix.

_{α}*X*original vector in the expression 3 and has

*M*=

*M*×

_{λ}*M*×

_{ρ}*M*components.

_{α}**X**and

**Y**in Equation 3 would have this number of components, making the rank of the unmixing matrix

**W**also over a thousand. Typical whitening schemes have a complexity cubic or higher against the number of components while they are linear against the number of samples. That is, a feasible whitening scheme should trade off the number of components involved and the redundancy reduction achieved to keep a low complexity and get a high performance.

*window*is usually employed to refer a given limited portion of the electromagnetic spectrum. As well, it is widely used to refer spatial limits in works in optics and computer vision. Hence, it is used to denote limits in the transmission and reception of optical and visual information from a given domain. Here the term is extrapolated to apply it to the reception of information from the environment by the brain, through the capture and representation of images using the visual system. Therefore, it refers the limited domain of optical magnitudes that the HVS—or any other visual system—is able to sense due to different factors. These limits, discretizations, and thresholds imposed to those magnitudes will constrain any visual transfer function.

*M*the number of discrete values of spectral wavelengths,

_{λ}**W**

*the chromatic whitening unmixing matrix, and $\lambda i\u2032$ a given whitened spectral wavelength, the idea is to compute the transformation that is a coordinate transformation in the spectral domain from an original chromatic representation $f\u2032=(i\lambda 1,\xb7\xb7\xb7,i\lambda M\lambda )$ to a whitened one $f\u2033=(i\lambda 1\u2032,\xb7\xb7\xb7,i\lambda M\lambda \u2032)$ Besides, similar to Equation 22 of the appendix, we have that the image intensity at each point is the sum of chromatic intensities*

_{c}**f**′ may be regarded as a spectral decomposition of the image at a given point. Otherwise, the squared norm in the whitened representation is the statistical distance or

*T*

^{2}of Hottelling, that is which is in fact a multivariate measure of variance. Since the samples are the pixel values, each point has a

*T*

^{2}value that gives its contribution to variance through the ensemble of samples. It is hence a measure of the pixel contribution to variance of chromatic spectral components on the image plane.

*M*×

_{ρ}*M*components $fj\u2032\u2032\u2032=(i\lambda j\u2032;\rho 1,\alpha 1,\xb7\xb7\xb7,i\lambda j\u2032;\rho M\rho ,\alpha M\alpha ).$ Each of these representations of whitened components can be further whitened, using as original coordinates those of the spatial frequency bands. Instead of such an approach, a simplification is adopted here. Whitening is proposed for each set of spatial frequency bands at a given spatial frequency angle, which reduces the number of components involved in whitening to

_{α}*M*(i.e., the number of scales). Therefore, the rank of every unmixing matrices

_{ρ}**W**

*is reduced to a maximum of*

_{jl}*M*. Otherwise, we have as many transformations as the product of the number of chromatic components by the number of orientations, that is

_{ρ}*M*×

_{λ}*M*parallel transformations.

_{α}*T*

^{2}of Hotelling of the original components. It is an approximation that arises from the summation of the

*T*

^{2}obtained for different subsets of original coordinates. It is worth noting that the approximations adopted did not reduce the effectiveness in explaining visual behavior in the experiments described below.

*r*,

*g*,

*b*) components instead of the narrow spectral components $ ( \lambda 1 ,\xb7\xb7\xb7 \lambda \lambda M \lambda ) $. We can apply exactly the same whitening schemes proposed above and we can take the resulting norm at each point in the image as a measure of relative variability or distinctiveness. Otherwise, the implications of this approximation will be examined in an experiment involving hyperspectral images in the visible spectrum. There, results using narrow spectral components and responses to broad detectors will be compared and analyzed.

*coding catastrophe*. A correlate of this catastrophe can be found in the perceptual adaptation underlying a variety of visual illusions.

*does not address why the coding catastrophe occurs because it lacks specification as to the computational goal beyond representation; rather, it embraces it without further question*(Schwartz et al., 2007).

*Z*is a basis of features that results transforming the input

*X*through PCA, the corresponding covariance matrix is diagonal

*Z*by the square root of the corresponding eigenvalue, that is by doing $yi=zi/\lambda i$-->, the elements of the diagonal become the unity. Consequently the covariance matrix for the resulting

*Y*coordinates also becomes the unity matrix satisfying Expression 2. Thus, the overall variance of the ensemble of samples (i.e., all the pixels) is the unity for each of the transformed components. We have also tried ICA for whitening but the results were equivalent in the most favorable cases for ICA. Thereby, whitened principal components were chosen because of both their higher computational lightness and their slightly better performance.

*AUC*= 0.7156 for the dataset of Bruce and Tsotsos and

*AUC*= 0.6462 for the dataset of Kootstra et al., again with standard error of 0.0008. Therefore, the overall shift observed in the results for models appears to reflect an equivalent shift in human consistency.

*be explained by their low-level features alone*. In other words, they suggested that faces introduced a strong influence of relevance able to drive early fixations.

Model | AUC | SE |

Itti | 0.6522 | 0.0007 |

Itti + faces | 0.7051 | 0.0005 |

AWS | 0.7188 | 0.0006 |

AWS + faces | 0.7568 | 0.0005 |

*S*=

*S*

_{center}/**Σ**

*S*. Taking raw values without normalizing did not yield better results in any case.

_{image}*attractiveness*mostly supported by the structural and chromatic singularity of faces in a natural environment with low need of attractiveness for humans or any other kind of relevance, in agreement with previous psychophysical results that reported that

*the ability to rapidly saccade to faces in natural scenes depends, at least in part, on low-level information*(Honey, Kirchner, & VanRullen, 2008). It is worth noting that such previous results do not take into account chromatic features but only content of spatial frequencies in an achromatic representation.

*Vision Research*

*,*33(1), 123–129. [CrossRef] [PubMed]

*Psychological Review*

*,*61(3), 183–193. [CrossRef] [PubMed]

*Possible principles underlying the transformation of sensory messages sensory communication*. Cambridge, MA: MIT Press.

*The Computing Neuron*(pp. 54–72). Boston, MA: Addison-Wesley.

*Advances in Neural Information Processing Systems: Vol. 18. Conference on Neural Information Processing Systems*(p. 155). Cambridge, MA: MIT Press.

*Journal of Vision*, 9(3):5, 1–24, http://www.journalofvision.org/9/3/5, doi:10.1167/9.3.5. [PubMed] [Article]. [CrossRef] [PubMed]

*Journal of Vision*, 9(12):10, 1–15, http://www.journalofvision.org/9/12/10, doi:10.1167/9.12.10. [PubMed] [Article]. [CrossRef] [PubMed]

*Vision Research*

*,*47(25), 3125–3131. [CrossRef] [PubMed]

*Science*

*,*327(5965), 584. [CrossRef] [PubMed]

*Trends in Cognitive Sciences*

*,*10(8), 382–390. [CrossRef] [PubMed]

*Visual Neuroscience*

*,*21(3), 331–336.

*Journal of Vision*, 8(2):6, 1–17, http://www.journalofvision.org/8/2/6, doi:10.1167/8.2.6. [PubMed] [Article]. [CrossRef] [PubMed]

*Journal of Vision*, 8(7):13, 1–18, http://www.journalofvision.org/content/8/7/13, doi:10.1167/8.7.13. [PubMed] [Article]. [CrossRef] [PubMed]

*Lecture Notes in Computer Science: Vol. 5807. Advanced Concepts for Intelligent Vision Systems*(pp. 343–354).

*Introduction to Fourier optics*. Greenwood Village, CO: Roberts & Company Publishers.

*Advances in Neural Information Processing Systems: Vol. 19. Conference on Neural Information Processing Systems*(p. 545). Cambridge, MA: MIT Press.

*Journal of Vision*, 8(12):9, 1–13, http://www.journalofvision.org/8/12/9, doi:10.1167/8.12.9. [PubMed] [Article]. [CrossRef] [PubMed]

*Advances in Neural Information Processing Systems: Vol. 21. Conference on Neural Information Processing Systems*(pp. 681–688). Red Hook, NY: Curran Associates, Inc.

*Network: Computation in Neural Systems*

*,*11(3), 191–210. [CrossRef]

*The Journal of Physiology*

*,*195(1), 215. [CrossRef] [PubMed]

*Vision Research*

*,*49(10), 1295–1306. [CrossRef] [PubMed]

*Vision Research*

*,*40(10–12), 1489–1506. [CrossRef] [PubMed]

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*,*20(11), 1254–1259. [CrossRef]

*Journal of Neurophysiology*

*,*97(5), 3155. [CrossRef] [PubMed]

*Proceedings of the British Machine Vision Conference*(pp. 1115–1125).

*Proceedings of the 31st Annual Conference of the Cognitive Science Society (CogSci09)*, July 29-August 1, 2009. Amsterdam, the Netherlands.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*,*28(5), 802–817. [CrossRef] [PubMed]

*Vision Research*

*,*42(17), 2095–2103. [CrossRef] [PubMed]

*Current Biology*

*,*19(6), R247–R248. [CrossRef] [PubMed]

*Vision Research*

*,*41(25–26), 3597–3611. [CrossRef] [PubMed]

*Applied Optics*

*,*47(20), 3574–3584. [CrossRef] [PubMed]

*Nature*

*,*381(6583), 607–609. [CrossRef] [PubMed]

*Vision Research*

*,*42(1), 107–123. [CrossRef] [PubMed]

*Neuron*

*,*64(5), 605–616. [CrossRef] [PubMed]

*Journal of Vision*, 7(14):16, 1–20, http://www.journalofvision.org/7/14/16, doi:10.1167/7.14.16. [PubMed] [Article]. [CrossRef] [PubMed]

*Fundamentals of photonics*. Hoboken, NJ: John Wiley & Sons.

*Nature Reviews Neuroscience*

*,*8(7), 522–535. [CrossRef] [PubMed]

*Journal of Vision*, 9(12):15, 1–27, http://www.journalofvision.org/9/12/15/, doi:10.1167/9.12.15. [PubMed] [Article]. [CrossRef] [PubMed]

*Journal of Vision*, 11(5):5, 1–23, http://www.journalofvision.org/11/5/5/, doi:10.1167/11.5.5. [CrossRef] [PubMed]

*Vision Research*

*,*45(5), 643–659. [CrossRef] [PubMed]

*Psychological Review*

*,*95(1), 15–48. [CrossRef] [PubMed]

*Vision's first steps: Anatomy, physiology, and perception in the retina, lateral geniculate nucleus, and early visual cortical areas*. In Dagnelie (Ed.),

*Visual prosthetics*:

*Physiology, bioengineering and rehabilitation*(p. 23). New York: Springer Verlag.

*Perception*

*,*34:409–420. [CrossRef] [PubMed]

*Journal of Vision*, 10(6):3, 1–17, http://www.journalofvision.org/10/6/3, doi:10.1167/10.6.3. [PubMed] [Article]. [CrossRef] [PubMed]

*Science*

*,*287(5456), 1273. [CrossRef] [PubMed]

*Vision Research*

*,*37(23), 3283–3298. [CrossRef] [PubMed]

*Journal of Vision*, 8(7):32, 1–20, http://www.journalofvision.org/8/7/32, doi:10.1167/8.7.32. [PubMed] [Article]. [CrossRef] [PubMed]

*A*, its spectral wavelength

*λ*and its wave number vector

**k**(i.e., its direction of propagation). with

**k**being a vector of free orientation and with norm

*k*= 2

*π/λ*, and being

*c*the speed of light.

*u*and

*v*the rectangular components of the two dimensional spatial frequencies on an image plane parallel to the

*x−y*plane, they are related to the wave number vector through the expression so that the spatial frequencies contributed by a given plane wave depend on the projection of its wave number vector on the

*x*–

*y*plane. That means they can be derived from both the angle with the image plane and its spectral wavelength, so that: where

*θ*and

_{x}*θ*are the angles that the wave number vector makes with the planes

_{y}*y−z*and

*x−z*, respectively, and the sine becomes the angle in the paraxial approximation (for small angles).

*x, y*) →

*p*. Using more conveniently polar instead of rectangular coordinates to represent spatial frequencies, an image can be formalized by the expressions: and where

*ρ*and

*α*are the radius and the angle of the spatial frequency in polar coordinates, respectively.

*λ*can be represented by the argument of the integral in the right side of the same equation.