The question asked is do the tensors produced by approaches I and II explain most of the variance in human ratings not already explained by Euclidean measures? To test the ability of each connectivity pattern to predict human similarity judgments, we compared correlation with human ratings. The outcomes of these analyses are apparent in
Table 1. In correlation with human ratings, Pearson's
r = 0.45 for Euclidean (MSE);
r = 0.63 for approach I Gaussian σ = 0.6 px (0.0310°);
r = 0.83 for approach I DOG; and
r = 0.76 for approach II (SceneIQ Online, linear axes, mean across two folds of data) all differ from chance (
p << 0.001) (
Table 1). Comparisons in terms of rank-order correlation and logistic regression are included in
Supplementary Materials.
Error curves for approach I Gaussians of various widths are compared in
Figure 6. We present further analyses using two parameters, σ = 2.0 px (0.1238°) for CSIQ Revised and σ = 0.6 px (0.0310°) for SceneIQ, found by taking the mean of the minimum-error σ across folds (rounded to 0.1). We did not select a parameter from CSIQ (JPEG) due to high variance in the minimum-error σ across folds. We did not note at the time that there may be some similarity between these values and the correlation among pixels recorded in the image dataset itself (see
Supplementary Figure S9b).
Figure 6d reports the two-dimensional error curve for difference-of-Gaussians models on one fold of the SceneIQ Online dataset. All folds reported the same local maximum: σ
center = 3.6 px (0.2228°), σ
surround = 5.2 px (0.3219°), α = 0.7. In comparison with neuronal response profiles (available for visualization in
Figure 3), these parameters appear intermediate: More broad profiles have been found in retinal ganglion cells (
Dacey, 1996;
Dacey, 2000;
Rodieck, 1965). More narrow and Gaussian-like profiles have been found by others (
Croner & Kaplan, 1995;
Dacey et al., 2000). From this DOG parameterization and the equation for DOG(
x) in
Equation 12, we can compute the contrast sensitivity function (CSF) that these parameters hypothesize (
Wandell, 1995):
\begin{eqnarray}
\begin{array}{@{}l@{}} {y}_i = DOG(i)/DOG(0)\forall i \in \left\{ { - 128{\rm{ }}..{\rm{ }}128} \right\}\\ \stackrel{{}_\rightharpoonup}{z}\ = {\rm{ }}fft(\stackrel{{}_\rightharpoonup}{y})\\ {z}_i = \left( {2/257} \right)z_i^2\forall i \in \{ 2{\rm{ }}..{\rm{ }}129\} \end{array}
\quad\end{eqnarray}
The result is depicted in
Figure 6e in terms of decibels, where
zi = 10 log
10(1 +
zi). The shape of this CSF is similar to those computed from DOG parameters in other works. For example,
Wuerger, Watson, and Ahumada (2002) fit difference-of-Gaussians parameters to human behavioral responses. The authors used these parameters as the basis of a spatial luminance CSF. In visual crowding, it was recently found that a novel measure of contrast, capable of relating DOG parameters to contrast sensitivity, accounts for a substantial amount of data (
Rodriguez & Granger, 2021).
This CSF peaks at 0.56 cycles per degree. Unlike CSFs found by some Gabor-based approaches (
Chandler, 2013;
Dacey, 2000), its shape hints at a bandpass (vs. lowpass) CSF. We note that CSFs have been used directly in many measures of image quality.
Li, Lu, Tao, and Gao (2008) acquired a bandpass CSF from prior experimentation (
Mannos & Sakrison, 1974), then applied it as a mask on the wavelet-transformed image to enhance SSIM. However, this CSF peaks drastically earlier (0.56 vs. 8 cycles per degree). The peak of our CSF is more comparable with those computed from difference-of-Gaussians luminance models by
Wuerger and colleagues (2002). Still, the peak of this CSF is low by human standards (
Souza, Gomes, & Silveira, 2011), perhaps hinting that subjects viewed the images peripherally.
It is important that the relationship between predictions and behavior be simple. Complicated relationships (e.g., logistic fits used in many IQA methods) or those with many parameters require further explanation.
Figure 7a illustrates empirical DMOS scores from SceneIQ Online as predicted by Euclidean and approach II. In this case, the fit between approach II and empirical human ratings (DMOS) appears linear when plotted on log–log axes (
Figure 7b). In
Figure 7, the four groupings of approach II ratings roughly correspond to the four JPEG quality levels in the dataset.
We compare the predictive capacity of approaches I and II against performance-driven IQA measures from the literature: SSIM, MS-SSIM (
Wang et al., 2003), IWSSIM (
Wang & Li, 2011), VSNR (
Chandler & Hemami, 2007), VIF (
Sheikh & Bovik, 2006), VIFP (
Sheikh & Bovik, 2006), IFC (
Sheikh, Bovik, & De Veciana, 2005), and GMSD (
Xue et al., 2014).
Table 2 compares these machine rating measures with one another, depicting significant differences in performance.
The above methods do not attempt to directly capture biological connectivity patterns nor link those to psychophysics—a central aim of this work. For this reason, we compare against two biologically relatable, shallow-layered, connectionist approaches nicknamed PerceptNet (
Hepburn, Laparra, Malo, McConville, & Santos-Rodriguez, 2020) and BioMultilayer (
Martinez-Garcia, Cyriac, Batard, Bertalmío, & Malo, 2018). BioMultilayer is of particular interest.
Martinez-Garcia and colleagues (2018) decomposed the stimulus–response map into multiple successive sub-maps, each containing a nonlinear operator. Like our strain model, each of these operators constructs Jacobians to determine how the value of each feature (e.g., pixel) affects others. Unlike our approach, BioMultilayer directly utilizes psychophysical masking and divisive normalization.
SceneIQ is evenly divided across eight semantic categories of visual scene. For approaches I and II, a single Jacobian was identified for the set of all images. Nonetheless, static transforms are consistently performant psychophysical predictors across individual image categories (
Table 1). They also can be seen to transfer well to the CSIQ dataset, where they remain among the best predictors. Unexpectedly, perception is easier to predict with this approach
across scene categories than
within them.
In
Table 1, we highlight results on the JPEG degradation, because it is the same type used in the SceneIQ dataset. To evaluate the presented approaches in the broadest possible range of scenarios, we also evaluate results on other degradation types. Results on the “TID2013 Mean” contain the mean correlation across 23 degradation types (source data in
Supplementary Table S5). To determine whether approaches I and II generalize across degradation methods, we measured correlations between model and human ratings for each of the five degradation types in the CSIQ dataset (
Table 3). Although CSIQ is an unusually Euclidean-biased dataset, we find that approach II and approach I, to a lesser extent, generalize well to other degradation methods. We fit an approach II model to each CSIQ degradation method independently, then tested them on the same degradation using two-fold cross-validation and the same methodology as other approach II models. These models performed well but were generally outperformed by SceneIQ-fit models (possibly due to limited and noisy data).
Approach I with a Gaussian hypothesis improves drastically on the Euclidean null hypothesis, indicating that Gaussian connectivity plays a role; for the Euclidean versus approach I, Gaussian σ = 0.6 px (0.0310°),
p < 0.001 (SceneIQ Online, two-tailed Fisher
r-to-
z; see Methods). However, the DOG hypothesis outperforms the Gaussian hypotheses; for approach I, Gaussian σ = 0.6 px (0.0310°) versus DOG,
p < 0.001 (SceneIQ Online, two-tailed Fisher
r-to-
z). In predicting perceived distance judgments, supplementing Euclidean with approach I DOG scores, linear fit log(DMOS) ∼ log(Euclidean distance) + log(approach I DOG score) (adjusted
R2 = 0.7166), is better than the Euclidean measure alone, with linear fit log(DMOS) ∼ log(Euclidean distance) (adjusted
R2 = 0.1989; SceneIQ Online). More surprisingly, both approach I DOG and approach II tensors reliably outperform several performance-driven IQA algorithms (
Table 1).