Despite extensive study of early vision, new and unexpected mechanisms continue to be identified. We introduce a novel formal treatment of the psychophysics of image similarity, derived directly from straightforward connectivity patterns in early visual pathways. The resulting differential geometry formulation is shown to provide accurate and explanatory accounts of human perceptual similarity judgments. The direct formal predictions are then shown to be further improved via simple regression on human behavioral reports, which in turn are used to construct more elaborate hypothesized neural connectivity patterns. It is shown that the predictive approaches introduced here outperform a standard successful published measure of perceived image fidelity; moreover, the approach provides clear explanatory principles of these similarity findings.

*strain*(as used in physics) involved in converting Euclidean image similarity into perceptual image similarity. We then derive an image-space similarity measure that matches. We show that straightforward properties of circuitry in the early visual pathway directly give rise to derived non-Euclidean similarity measures. These similarity measures are predictive of human behavioral responses, providing a link between early visual circuitry and behavior. The results are compatible with findings in the literatures of psychophysics (e.g., Oliva et al., 2005; Yue, Biederman, Mangini, von der Malsburg, & Amir, 2012) and IQA (e.g., Pons et al., 1999). Moreover, the formulation has already been shown to account for a seemingly unrelated set of visual psychophysics phenomena (i.e., crowding) (Rodriguez & Granger, 2021). We believe that this formalism is the first to use strain to directly approach the possible causal relationships between biological connectivity and psychophysics. In summary, this formalism may be used to refine our understanding of the unseen processes that give way to the quirks of human visual perception and ultimately prove useful to future applications in predicting judgments and behavior.

*D*is the number of pixels in the image. This can be rewritten, in linear algebra terms, as the dot product of the vector between \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) and \(\stackrel{{}_\rightharpoonup}{s}\!\!{{\rm{^\prime}}}\). Dropping the 1/

*D*term (which is constant in each dataset) yields a measure of image difference which is the (squared) Euclidean distance:

*displacement field*, \({\stackrel{{}_\rightharpoonup}{u}} ( {{\stackrel{{}_\rightharpoonup}{s}}})\):

*changes*along the path from \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) to \({\rm{\stackrel{{}_\rightharpoonup}{s}\!\!{{\rm{^\prime}}}}}\). The approximation will be poor if the path from \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) to \({\rm{\stackrel{{}_\rightharpoonup}{s}\!\!{{\rm{^\prime}}}}}\) is sufficiently nonlinear. Two conditions can guarantee an accurate approximation. First, the displacement field can have little curvature relative to the distance between images. Second, the distance between images can be sufficiently small to make any curvature irrelevant. Although the degradations that we evaluate (see Methods) are at times obvious to subjects, they are miniscule on the scale of image space. That is, degradations never transform one reference image into another one, or make nonlocal changes. In the IQA task, we believe that both conditions can be taken as reasonable assumptions, at the cost of some modeling error. Per the results, even an imprecise first-order approximation appears to capture valuable patterns. An important next step will be to utilize highly nonlinear models of perceptual strain, building on important prior work (Epifanio et al., 2003; Laparra et al., 2010; Malo et al., 2000; Malo et al., 2005; Pons et al., 1999).

**I**is the identity matrix (the tensor of Cartesian coordinates in Euclidean space); \({\stackrel{{}_\rightharpoonup}{\nabla }}_s{{\rm{\stackrel{{}_\rightharpoonup}{u}}}}^{\rm T}\) is a matrix where the value in the

*i*th row and

*j*th column and ∂

*u*/∂

_{i}*s*, describe how much additional displacement the luminance change to pixel

_{j}*j*contributes to the perceptual displacement of pixel

*i*:

*u*/∂

_{i}*s*suggests a strong connection between pixels

_{j}*i*and

*j*. The Euclidean distance metric in Cartesian coordinates of pixels (e.g., Equations 2 and 4) has the identity matrix as its tensor. Now that we understand Equation 7, we can compute perceived distance (Equation 4) without reference to \({{\rm{\stackrel{{}_\rightharpoonup}{s}}}}_\mathcal{P}\):

**P**as an operation that changes each bitmap image stimulus \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) (a point on Cartesian coordinates of pixels) to a perceived vector \({{\rm{\stackrel{{}_\rightharpoonup}{s}}}}_\mathcal{P}\):

**P**is a locally multilinear operator—a

*D*×

*D*matrix for each stimulus. The main diagonal represents 1:1 topographic connectivity among neurons, or an unmodified percept. Each off-diagonal describes an additional biological connection or perceptual interaction (that may strain image space). This matrix is a simple way to quantify connectomes and local projection patterns like those in Figure 3. Here, each element of

**P**is a scalar function describing how a pair of neurons, receptive fields, concepts, or brain regions relate. This concept of connectivity is believed to be equivalent to some quantifications of linear cell receptive fields (Chichilnisky, 2001). Qualitatively, the incoming connectivity to a neuron defines its receptive field.

*i*and

*j*on an image (retinal distance,

*d*(

_{ret}*i*,

*j*) (Dacey et al., 2000; De Monasterio, 1978; Sincich & Blasdel, 2001; Young, 1987; Young & Lesperance, 2001); for electrophysiological examples, see Figure 3b:

**P**to the

*gradient*of \(\stackrel{{}_\rightharpoonup}{u}\):

**I**. We make this replacement and apply a transpose, returning the left side to something more simply expressed:

**P**can be written in terms of the derivative of the displacement field. We can say that the displacement field is

*generated by*the perceptual operator.

**P**. Using Equation 15, we can insert

**P**into Equation 9. The perceived difference between \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) and \({{\rm{\stackrel{{}_\rightharpoonup}{s}}}}{}^{\prime}\) is simply

**P**is the Jacobian of the perceptual distortion. If there exists no perceptual strain,

**P**=

**I**, and \(d_\mathcal{P}^2( {{\rm{\stackrel{{}_\rightharpoonup}{s}}},{{{\rm{\stackrel{{}_\rightharpoonup}{s}}}}}{}^{\prime}} ) = d_E^2( {{\rm{\stackrel{{}_\rightharpoonup}{s}}},{{{\rm{\stackrel{{}_\rightharpoonup}{s}}}}}{}^{\prime}} )\).

**P**, Equation 16 can be used to calculate the perceived difference without direct measurement of the displacement field. Instead, Equation 16 predicts perceived difference using a perceptual strain that has been inferred from biological projection patterns. In the next section, we will introduce two approaches for selecting

**P**.

**P**. We evaluate several possible forms of

**P**herein. The first is Gaussian connectivity between the cells that favor pixels

*i*and

*j*(as described earlier and in Figure 3b):

**P**has 1’s along the diagonal. When we subtract

**I**in Equation 15, we zero the diagonal elements of \({( {{{\stackrel{{}_\rightharpoonup}{\nabla }}}_s{{{\rm{\stackrel{{}_\rightharpoonup}{u}}}}}^{\rm T}} )}^{\rm T}\) (and thus the strain tensor; see Derivation of strain tensor). Together, these components account for image space dilation, which cannot be measured using relative psychophysical distances. (Diagonal connectivity, the Euclidean component of perception, is separately represented by

**I**in Equation 15.)

**P**

_{Gauss}(Equation 17) and

**P**

_{DOG}(Equation 18) are used as examples of “approach I” herein. This simple approach produces a Jacobian from the displacement field, which lets us measure the perceived distance between two stimuli. The resulting Jacobians are of course unlikely to be perfectly accurate representations of the actual connectivity patterns in early visual pathways, which are shaped by development and learning.

**P**that causes maximal correlation between perceptual dissimilarity (computed between each \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) and \({{\rm{\stackrel{{}_\rightharpoonup}{s}}}}{}^{\prime}\) using Equation 16) and human difference ratings.

*plus*the degree to which \({\rm{\stackrel{{}_\rightharpoonup}{u}}}( {{\rm{\stackrel{{}_\rightharpoonup}{s}\!\!{{\rm{^\prime}}}}}} )\) differs from \({\rm{\stackrel{{}_\rightharpoonup}{u}}}( {{\rm{\stackrel{{}_\rightharpoonup}{s}}}} )\) as we move from \({\rm{\stackrel{{}_\rightharpoonup}{s}}}\) to \({\rm{\stackrel{{}_\rightharpoonup}{s}\!\!{{\rm{^\prime}}}}}\):

**ε**is the strain tensor and

**I**, the identity matrix, was the original tensor. By distribution of the previous equation, we can create an alternative definition of perceptual distance (not required herein):

*change in distance*caused by perceptual strain.

*N*< 35, precise count unknown) placed together on a linear scale such that pairwise distances between the images matched perceived difference. The TID 2013 (Ponomarenko et al., 2015) and Toyama (Tourancheau, Autrusseau, Sazzad, & Horita, 2008) datasets were also utilized for breadth. These datasets contain similar imagery, with slightly varying image sizes and measures. Each of these datasets contains subsets with different image degradation methods (see citations). Regardless of dataset, all human ratings reported here are normalized to a range of [0, 1], where 0 is no perceived distance (perfect fidelity).

*N*across categories. We used 260 images per category, the number of images in the rarest category. Each image in the dataset was degraded into four JPEG quality levels: 30%, 20%, 10%, and 5% using ImageJ (National Institutes of Health, Bethesda, MD) (Schneider, Rasband, & Eliceiri, 2012).

**P**is local and uniform across the image (see Discussion for a simple relief from this extension). This assumption is congruent with the low-level visual system, wherein relations between representations of topographical neighbors are one dominant component (Dacey et al., 2000; De Monasterio, 1978; Hubel & Wiesel, 1962; Martinez & Alonso, 2003; Sincich & Blasdel, 2001; Von der Malsburg, 1973; Young, 1987; Young & Lesperance, 2001). It is also approximately true for the radial basis functions explored in approach I. Images were split into 8 × 8-pixel tiles. This enabled us to optimize a single 64 × 64 Jacobian (rather than a 65,536 × 65,536 Jacobian that has an untenable billion-dimensional error surface). The 64 × 64 Jacobian was used to compare each tile of an image, after which the tile distances were summed. This tiling approach is consistent with JPEG (Pennebaker & Mitchell, 1992; Wallace, 1992), related compression methods (Bowen, Felch, Granger, & Rodriguez, 2018), and other IQA measures (e.g., SSIM; Wang et al., 2004). Sub-imaging greatly increased the number of data points used for training while simplifying the task.

*p*value is not significant. By contrast, the paired

*t*-test (on Fisher

*z*-transformed correlations) can yield any

*p*value but will be sensitive to outliers and variability in the results. Instead, we measured, for each of the two folds reported in Table 1, a two-tailed Fisher

*r*-to-

*z*transformation, then took the mean across folds.

*N*among semantic categories), and two approach II Jacobians were independently optimized in a two-fold cross validation. Each Jacobian was only used to predict images uninvolved in its training, and the two sets of test scores were pooled (without modification) for comparison with DMOS.

*r*= 0.45 for Euclidean (MSE);

*r*= 0.63 for approach I Gaussian σ = 0.6 px (0.0310°);

*r*= 0.83 for approach I DOG; and

*r*= 0.76 for approach II (SceneIQ Online, linear axes, mean across two folds of data) all differ from chance (

*p*<< 0.001) (Table 1). Comparisons in terms of rank-order correlation and logistic regression are included in Supplementary Materials.

_{center}= 3.6 px (0.2228°), σ

_{surround}= 5.2 px (0.3219°), α = 0.7. In comparison with neuronal response profiles (available for visualization in Figure 3), these parameters appear intermediate: More broad profiles have been found in retinal ganglion cells (Dacey, 1996; Dacey, 2000; Rodieck, 1965). More narrow and Gaussian-like profiles have been found by others (Croner & Kaplan, 1995; Dacey et al., 2000). From this DOG parameterization and the equation for DOG(

*x*) in Equation 12, we can compute the contrast sensitivity function (CSF) that these parameters hypothesize (Wandell, 1995):

*z*= 10 log

_{i}_{10}(1 +

*z*). The shape of this CSF is similar to those computed from DOG parameters in other works. For example, Wuerger, Watson, and Ahumada (2002) fit difference-of-Gaussians parameters to human behavioral responses. The authors used these parameters as the basis of a spatial luminance CSF. In visual crowding, it was recently found that a novel measure of contrast, capable of relating DOG parameters to contrast sensitivity, accounts for a substantial amount of data (Rodriguez & Granger, 2021).

_{i}*across*scene categories than

*within*them.

*p*< 0.001 (SceneIQ Online, two-tailed Fisher

*r*-to-

*z*; see Methods). However, the DOG hypothesis outperforms the Gaussian hypotheses; for approach I, Gaussian σ = 0.6 px (0.0310°) versus DOG,

*p*< 0.001 (SceneIQ Online, two-tailed Fisher

*r*-to-

*z*). In predicting perceived distance judgments, supplementing Euclidean with approach I DOG scores, linear fit log(DMOS) ∼ log(Euclidean distance) + log(approach I DOG score) (adjusted

*R*

^{2}= 0.7166), is better than the Euclidean measure alone, with linear fit log(DMOS) ∼ log(Euclidean distance) (adjusted

*R*

^{2}= 0.1989; SceneIQ Online). More surprisingly, both approach I DOG and approach II tensors reliably outperform several performance-driven IQA algorithms (Table 1).

*u*/∂

_{i}*s*), rather than input–output relations,

_{j}*y*= \(f( {{\rm{\stackrel{{}_\rightharpoonup}{s}}}} )\), as is more typical in artificial neural network approaches. The findings suggest the potential of such formalisms to help us understand how patterns of individual associations yield the gestalt of an image percept, which is composed of many outputs working together rather than in isolation. Such a formalism is aligned with many insights neuroscientists have acquired about connectivity. Further characterization of the types of candidate hypotheses is in progress.

*Psychological Review,*95, 124. [CrossRef]

*Advances in Neural Information Processing Systems,*2017-December, 3531–3540.

*IEEE Transactions on Image Processing,*8, 717–730. [CrossRef]

*Perspectives on Psychological Science,*6, 3–5. [CrossRef]

*International Scholarly Research Notices,*2013, 1–53.

*IEEE Transactions on Image Processing,*16, 2284–2298. [CrossRef]

*Network: Computation in Neural Systems,*12, 199. [CrossRef]

*Vision Research,*35, 7–24. [CrossRef]

*Neural Computation,*28, 2628–2655. [CrossRef]

*Neural Computation,*30, 1612–1623. [CrossRef]

*Proceedings of the National Academy of Sciences, USA,*93, 582–588. [CrossRef]

*Annual Review of Neuroscience,*23, 743–775. [CrossRef]

*Vision Research,*40, 1801–1811. [CrossRef]

*Proceedings of the National Academy of Sciences, USA,*89, 9666–9670. [CrossRef]

*Proceedings Volume 1666, Human Vision, Visual Processing, and Digital Display III*(pp. 2–16). Bellingham, WA: SPIE.

*IEEE Transactions on Image Processing,*9, 636–650. [CrossRef]

*Nature Neuroscience,*4, 1244–1252.

*Journal of Neurophysiology,*41, 1418–1434.

*Perception & Psychophysics,*60, 65–81.

*Psychonomic Bulletin & Review,*6, 239–268.

*Behavioral and Brain Sciences,*21, 449–498.

*Frontiers in Computational Neuroscience,*6, 45.

*Proceedings of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics, VPQM 2006*(pp. 1–4). New York: Springer.

*Journal of Mathematical Psychology,*56, 404–416.

*Pattern Recognition,*36, 1799–1811.

*Electronics Letters,*48, 631–633.

*Elemente der psychophysik*. Leipzig: Breitkopf und Härtel.

*Journal of Mathematical Psychology,*53, 86–91.

*Biological Cybernetics,*51, 305–312.

*Computer Vision–ECCV 2006*(pp. 56–69). Berlin: Springer.

*Cognition,*52, 125–157.

*IEEE Transactions on Multimedia,*18, 1098–1110.

*PLoS Biology,*6, e187.

*2020 IEEE International Conference on Image Processing (ICIP)*(pp. 121–125). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*The Journal of Physiology,*160, 106–154.

*ITU-T recommendation P.910: Subjective video quality assessment methods for multimedia applications*. Geneva, Switzerland: International Telecommunication Union Standardization Sector.

*ITU-T recommendation P.800.1: Mean opinion score terminology*. Geneva, Switzerland: International Telecommunication Union Standardization Sector.

*Nature Reviews Neuroscience,*2, 194.

*Trends in Cognitive Sciences,*17, 401–412.

*Journal of Visual Communication and Image Representation,*40, 76–84.

*Journal of Visual Communication and Image Representation,*11, 17–40.

*Theory of elasticity*. Oxford, UK: Butterworth.

*Journal of the Optical Society of America. A, Optics, Image Science, and Vision,*27, 852–864.

*Journal of Electronic Imaging,*19, 11006–11006.

*Neural Computation,*1, 541–551.

*Signal Processing: Image Communication,*25, 517–526.

*Journal of Visual Communication and Image Representation,*22, 297–312.

*2008 IEEE International Conference on Systems, Man, and Cybernetics*(pp. 2246–2251). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*IEEE Transactions on Communications,*30, 1679–1692.

*IEEE Transactions on Image Processing,*15, 68–80.

*Image and Vision Computing,*18, 233–246.

*IEEE Transactions on Information Theory,*20, 525–536.

*The Neuroscientist,*9, 317–331.

*PLoS One,*13, e0201326.

*Psychological Review,*100, 254.

*IEEE Journal of Selected Topics in Signal Processing,*3, 193–201.

*IEEE Transactions on Image Processing,*20, 3350–3364.

*IEEE Transactions on Neural Networks,*21, 515–519.

*Spatial Vision,*5, 81–100.

*International Journal of Computer Vision,*42, 145–175.

*Neural Computation,*17, 969–990.

*The Visual Neurosciences,*2, 1603–1615.

*JPEG: Still image data compression standard*. Berlin: Springer Science & Business Media.

*Journal of Physiology (Paris),*97, 265–309.

*Signal Processing: Image Communication,*30, 57–77.

*Proceedings of the Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics, VPQM 07*(pp. 1–4). New York: Springer.

*Displays,*20, 93–110.

*AMS Lectures on Mathematics in the Life Sciences,*7, 217–232.

*Vision Research,*5, 583–601.

*Journal of Vision,*21(1):4, 1–19, https://doi.org/10.1167/jov.21.1.4.

*Frontiers in Computational Neuroscience,*6, 35.

*IEEE Transactions on Computational Imaging,*3, 110–125.

*IEEE Transactions on Image Processing,*18, 2385–2401.

*Biological Cybernetics,*98, 33–48.

*Nature Methods,*9, 671.

*Science,*290, 2268–2269.

*Journal of Mathematical Psychology,*70, 21–34.

*IEEE Transactions on Image Processing,*15, 430–444.

*IEEE Transactions on Image Processing,*14, 2117–2128.

*IEEE Transactions on Image Processing,*15, 422–429.

*Annual Review of Neuroscience,*24, 1193–1216.

*Journal of Neuroscience,*21, 4416–4426.

*Psychology & Neuroscience,*4, 29–48.

*Proceedings of 1st International Conference on Image Processing*(pp. 982–986). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*IEEE Conference on Computer Vision and Pattern Recognition*(pp. 5306–5314). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*Network: Computation in Neural Systems,*14, 391–412.

*2008 15th IEEE Conference on Image Processing*(pp. 365–368). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*Vision Research,*38, 2289–2305.

*Kybernetik,*14, 85–100.

*IEEE Transactions on Consumer Electronics,*38, xviii–xxxiv.

*Progress in Neurobiology,*51, 167–194.

*Foundations of vision*. Sunderland, MD: Sinauer Associates.

*IEEE Signal Processing Letters,*9, 81–84.

*IEEE Signal Processing Magazine*, 98–117.

*IEEE Transactions on Image Processing,*13, 600–612.

*IEEE Transactions on Image Processing,*20, 1185–1198.

*2006 International Conference on Image Processing*(pp. 2945–2948). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*The Thirty-Seventh Asilomar Conference on Signals, Systems, & Computers*(pp. 1398–1402). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

*IEEE Journal of Selected Topics in Signal Processing,*6, 616–625.

*IEEE Transactions on Image Processing,*22, 43–54.

*Human Vision and Electronic Imaging VII*(pp. 159–172). Bellingham, WA: SPIE.

*IEEE Transactions on Image Processing,*23, 684–695.

*Spatial Vision,*2, 273–293.

*Spatial Vision,*14, 321–389.

*Vision Research,*55, 41–46.

*IEEE Transactions on Image Processing,*23, 4270–4281.

*IEEE Transactions on Image Processing,*20, 2378–2386.