**Abstract**:

**Abstract**
**We propose an image quality model based on phase and amplitude differences between a reference and a distorted image. The proposed model is motivated by the fact that polar representations can separate visual information in a more independent and efficient manner than Cartesian representations in the primary visual cortex (V1). We subsequently estimate the model parameters from a large subjective data set using maximum likelihood methods. By comparing the various model hypotheses on the functional form about the phase and amplitude, we find that: (a) discrimination of visual orientation is important for quality assessment and yet a coarse level of such discrimination seems sufficient; and (b) a product-based amplitude–phase combination before pooling is effective, suggesting an interesting viewpoint about the functional structure of the simple cells and complex cells in V1.**

*J*D ×

*T*means there are

*T*subspaces and the size of each subspace is

*J*). Despite the lack of physiological evidence, ISA supports flexible configurations and can simulate a theoretically optimal model of V1 at various ratios of complex cells to simple cells and various numbers of complex cells. Of course, such simulations only take account of the participating neurons, without consideration of the redundancy trait of the neural system.

*J*dimensional (

*J*-D) space, the polar representation with a scalar amplitude and a (

*J*− 1)-D phase provides an alternative to the Cartesian representation. The polar representation is more likely to independently separate visual information than the Cartesian representation. V1 might provide suitable substrates for amplitude–phase encoding (Zetzsche & Krieger, 1999). Indeed, non-Cartesian cells are found in area V4 of macaque monkey (Gallant, Braun, & Van Essen, 1993), and sensitive to shape and size, but not location (Gallant, 2000). It is natural for a neural realization to perceive image quality based on the amplitude difference and the phase difference from the distorted image to the original image. The amplitude difference is defined as where

_{r}and

_{d}are the pair of ISA response vectors corresponding to the reference and distorted image, and operator || calculates the magnitude of a vector.

*f*(

*ρ*,

*θ*;

*α*,

*β*,

*γ*) represent the combination of the amplitude and the phase differences (

*f*with parameters {

*α*,

*β*,

*γ*} will be instantiated later). The functional responses over all ISA subspaces and all image patches are pooled by summing. Considering that simple cells show selectivity of spatial frequency, we down-sample an image progressively in a ratio of 1 : 2 to form a pyramid, calculate

*f*(

*ρ*,

*θ*) by using the same ISA bases and the identical {

*α*,

*β*,

*γ*} at each scale, and compute a weighted sum over all scales. The parameters {,,} are adaptive for different scales so as to fit the data, and will be estimated by regression (see Appendix C).

*f*to a quality score

*q*where parameters

*a*and

*b*control the curve shape and thereby influence how much the floor and ceiling effects impact on the distortion

*f*. The floor and ceiling effects depend on the context of the test materials, so

*a*and

*b*should be associated with each database. In this study, our regression method supports using a single set of {, , } for all databases and an adaptive set of {

*a, b*} for each database (see details in Appendix C). Using adaptive {

*a, b*} values can compensate the misaligned floor and ceiling effects across multiple datasets. The functional form of V1 is related to only {, , } but not {

*a, b*}. Factually, {

*a, b*} does not change the quality ranking for a database.

Database | LIVE | IVC | Toyama | TID | A57 | WIQ | CSIQ | LAR | BA | FourierSB | Meerwald |

Number of rated images | 779 | 182 | 168 | 1,700 | 54 | 80 | 866 | 120 | 120 | 210 | 120 |

Number of distortion types | 5 | 5 | 2 | 17 | 5 | 1 | 6 | 3 | 2 | 6 | 2 |

Image type | Color | Color | Color | Color | Gray | Gray | Color | Color | Gray | Gray | Gray |

Resolution | ∼768 × 512 | 512 × 512 | 768 × 512 | 512 × 512 | 512 × 512 | 512 × 512 | 512 × 512 | 512 × 512 | 512 × 512 | 512 × 512 | 512 × 512 |

Number of subjects | 20 ∼ 29 | 15 | 16 | 33 | 7 | 30 | 5 ∼ 7 | 19 | 17 | 7 | 14 |

Screen | 21″ CRT | CRT | 17″ CRT | 19″ LCD | Papers | 17″ LCD/CRT | LCD | CRT | 24″ LCD | LCD | 24″ LCD |

Distance | 2 ∼ 2.5 Hs | 6 Hs | 4 Hp | Varying | 4 Hp | 4 ∼ 6 Hs | 80 cm | 4 Hp | 6 Hs | 6 Hs | 6 Hs |

Rating method | ACR | DSIS | ACR | PC | MSCQS | DSCQS | MSCQS | DSIS | DSIS | DSIS | DSIS |

Subjective data | DMOS | MOS | Raw | MOS | DMOS | Raw | DMOS | Raw | Raw | Raw | Raw |

*ρ*

_{s}. The likelihood Equation 7 measures the “agreement” of the additive log-logistic model with the subjective quality scores. The coefficient

*ρ*

_{s}evaluates the ordinal match between the predicted and the subjective quality scores, and thus remains invariant with any monotonic mapping of the data, including the two-parameter (i.e.,

*a*and

*b*) log-logistic mapping in our model. For both criteria, the higher the value, the better the accuracy;

*ρ*

_{s}has a range of [−1, 1], while the likelihood does not have that constant range and its value also depends on the number and the distribution of the data. Hence, we use

*ρ*

_{s}to quantify the accuracy of metrics, and use the likelihood to assist in the comparison.

*f*). The ISA bases are trained offline using FastICA toolbox from Hyvärianen's group, independent to the quality prediction or the model regression; actually the thirteen natural monochromatic images for ISA bases training are totally different from the images in the 11 databases.

*θ*= 0 constantly. Here, we keep the ISA bases as complete as possible. Then, we instantiate the combination

*f*(

*ρ*,

*θ*) with a product:

_{1∼15}with the first 15 distortion types and TID

_{16∼17}with the last two types, namely “intensity shift” and “contrast change,” which globally adjust the mean and the variance of images, respectively, as shown in Figure 3. We separate TID

_{16∼17}from the full set because the proposed metric is not good at it, which we will discuss later.

Database | LIVE | IVC | Toyama | TID_{1} _{∼} _{15} | TID_{16} _{∼} _{17} | A57 | WIQ | CSIQ | LAR | BA | FourierSB | Meerwald |

Proposed | 0.948 | 0.910 | 0.925 | 0.907 | 0.281 | 0.913 | 0.799 | 0.961 | 0.930 | 0.931 | 0.906 | 0.928 |

MSE | 0.856 | 0.679 | 0.613 | 0.532 | 0.476 | 0.570 | 0.817 | 0.806 | 0.819 | 0.934 | 0.696 | 0.891 |

CW-SSIM | 0.852 | 0.621 | 0.784 | 0.642 | 0.482 | 0.656 | 0.621 | 0.577 | 0.920 | 0.631 | 0.055 | 0.795 |

SIQM | 0.956 | 0.894 | 0.915 | 0.831 | 0.807 | 0.894 | 0.842 | 0.924 | 0.892 | 0.952 | 0.846 | 0.940 |

FSIM | 0.963 | 0.926 | 0.906 | 0.882 | 0.881 | 0.918 | 0.806 | 0.924 | 0.958 | 0.934 | 0.914 | 0.930 |

*α*and

*β*approximate the nonlinear responses to the amplitude and phase, respectively. When

*α*or

*β*takes a value of 1, it is degenerated into a linear response, and when

*α*or

*β*takes a small value near 0, it implies no response and the amplitude or phase term can be omitted. Note that we have assumed subspace pooling, spatial pooling, and spatial-frequency pooling all as summation functions. Accordingly, the product-based function suggests that the amplitude–phase combination is inseparable, and thus, prior to pooling, while the sum-based function implies that the phase and the amplitude may be decoupled and thus, there is no priority between them. Such a difference not only reflects mathematical logic, but also has potentially implications for neural structure.

*γ*→ ∞ and

*γ*→ 0, respectively. We furthermore consider the third order polynomial approximation:

*θ*records how far away the phase deviates within a limited range of orientations and locations, but is blind to the orientation or location from which the phase departs. Therefore, with fewer subspaces, more phase information is disregarded by the quality model.

*α*,

_{l}*β*

_{l},

*γ*|

_{l}*l*= 1, 2, 3} for all databases and adaptive parameters {

*a*,

_{m}*b*|

_{m}*m*= 1, 2, …, 11} for the

*m*-th database. Due to the redundancy between {} and

*a*, we set

*γ*

_{1}= 1 constantly. That is, although we introduce 31 parameters, the degree of freedom for the proposed V1 model is only 8.

*,*, } are consistent under various configurations of ISA subspaces, except

**β***β*

_{1}for the phase at the finest scale. The phase of the finest HF components is insignificant for IQA as

*β*

_{1}can be omitted (i.e., set as 0) especially when complete ISA bases are configured. This is in accordance with the view that amplitude is more important than phase at fine image scales (Field & Chandler, 2012).

Completeness of ISA bases | Complete bases in Experiment 1 | Incomplete bases in Experiment 2 | ||||

Scale l (Fine → Coarse) | Scale 1 | Scale 2 | Scale 3 | Scale 1 | Scale 2 | Scale 3 |

α: exponent of amplitude error_{l} | 1.96 ± 0.40 | 0.62 ± 0.16 | 1.08 ± 0.51 | 2.02 ± 0.32 | 0.53 ± 0.12 | 1.14 ± 0.07 |

β: exponent of phase error_{l} | (2.2 ± 7.6) × 10^{−4} | 0.45 ± 0.20 | 0.84 ± 0.22 | 0.60 ± 0.72 | 0.45 ± 0.14 | 0.57 ± 0.09 |

logγ: weight of image scale_{l} | 0 | 7.14 ± 0.46 | 6.73 ± 0.81 | 0 | 7.11 ± 0.24 | 5.97 ± 0.85 |

*o*and a distance

*d*, the iso-distance curve consists of all the points that are located at distance

*d*from

*o*. Let us consider a 2D polar coordinate system for simplification. Given a reference point with radius of 1 and phase angle of 0, noted by (1, 0), its iso-distance map under metric Equation 4 is shown in Figure 5a. A point has a distance of zero from the reference, as long as either its phase or its amplitude remains the same as the reference. This differs from the iso-distance map under MSE as shown in Figure 5b, where a point moves farther away from the reference point unless both its amplitude and phase are equal to that of the reference. Such a difference is because the metric of Equation 4 employs a product to combine the amplitude and phase error, while MSE approximates to a sum of two items related to the amplitude and phase error, respectively (Hsiao & Millane, 2004).

*ρ*

_{s}in Figure 4. This is mainly because of their low accuracy on the TID

_{16∼17}and the WIQ database. If excluding them, the worst accuracies (marked by red crosses) are not significantly lower than the best ones. The proposed metric inaccurately measures “intensity shift” because the image mean is overlooked by using the ISA bases which are obtained from the whitened data with zero mean. “Contrast change” is not measured appropriately here, because contrast change (i.e., amplitude difference) is simply regarded as distortions no matter if the contrast is enhanced or degraded. Most existing metrics fail to handle WIQ (as shown in Table 2), because the image distortion in WIQ, termed as wireless channel distortion, is often uneven and localized. Hence, the simulated RF of ISA bases, the distortion factor of absolute amplitude difference, and the pooling strategy of summation function are probably too oversimplified, since subjective assessment for such distortions may involve a more complex process in high-level vision.

*equal*weights (i.e., the same

*α*,

*β*, and

*γ*), since there is no evidence that any set of V1 neurons have priority or account for the majority.

- Both phase and amplitude are indispensible for IQA, and thus, both simple cells and complex cells contribute to IQA. Besides the amplitude detection, the phase difference provides another potential way of information reduction.
- Not all the phase information is helpful for IQA; only the phase that discriminates coarse orientations is essential.
- The product of phase and amplitude can capture these combinations and thus, the coactions between simple and complex cells rather than summation or other nonlinear operators, and thereby the human visual system tolerates the amplitude-invariant and phase-invariant distortions.
- The amplitude–phase combination occurs prior to the pooling, which implies the linkages among simple and complex cells precede the aggregation of the neurons that represents various locations of the visual field.

*Journal of the Optical Society of America A**,*2

*,*284–299. [CrossRef]

*Statistics for psychology*. Upper Saddle River, NJ: Prentice Hall.

*Neurocomputing**,*69

*,*1301–1304. [CrossRef]

*Journal of Vision**,*12 (3): 7, 1–11, http://www.journalofvision.org/content/12/3/7, doi:10.1167/12.3.7. [PubMed] [Article]

*Journal of Vision**,*9 (1): 35, 1–15, http://www.journalofvision.org/content/9/1/35, doi:10.1167/9.1.35. [PubMed] [Article] [PubMed]

*Journal of the Optical Society of America A**,*29

*,*55–67. [CrossRef]

*Nature Neurosceince**,*14

*,*1195–1201. [CrossRef]

*( 2nd Ed., 311–324), San Diego, CA: Academic Press.*

*Seeing*

*Science**,*259

*,*100–103. [CrossRef] [PubMed]

*Nature**,*400

*,*65–69. [CrossRef] [PubMed]

*. New York: Chapman and Hall.*

*Generalized Additive Models*

*Proceedings of SPIE**,*5562

*,*27–37.

*Journal of Phsyiology**,*160

*,*106–154. [CrossRef]

*Neural Computation**,*12

*,*1705–1720. [CrossRef] [PubMed]

*Vision Research**,*41

*,*2413–2423. [CrossRef] [PubMed]

*Journal of Neurophysiology**,*99

*,*2745–2754. [CrossRef] [PubMed]

*Journal of Vision**,*9 (1): 2, 1–16, http://www.journalofivison.org/content/9/1/2, doi:10.1167/9.1.2. [PubMed] [Article] [PubMed]

*Signal Process**,*24

*,*1–10. [CrossRef]

*Nature Neuroscience**,*8

*,*679–685. [CrossRef] [PubMed]

*Nature**,*457

*,*83–86. [CrossRef] [PubMed]

*Journal of Neurophysiology**,*103

*,*3465–3471. [CrossRef] [PubMed]

*IEEE Transactions on Image Processing**,*21

*,*3364–3377. [CrossRef] [PubMed]

*Nature**,*382

*,*63–66. [CrossRef] [PubMed]

*Nature**,*381

*,*607–609. [CrossRef] [PubMed]

*Proceedings of the IEEE**,*69

*,*529–541. [CrossRef]

*Perception**,*11

*,*337–346. [CrossRef] [PubMed]

*Advances of Modern Radioelectronics**,*10

*,*30–45.

*IEEE Transactions on Image Processing**,*18

*,*2385–2401. [CrossRef] [PubMed]

*Neuron**,*51

*,*661–670. [CrossRef] [PubMed]

*. Canary, NC: Oxford University Press.*

*Appraisal processes in emotion: theory, methods, research*

*Journal of Comparative Neurology**,*177

*,*213–235. [CrossRef] [PubMed]

*Current Opinion Neurobiology**,*4

*,*157–165. [CrossRef]

*IEEE Transactions on Image Processing**,*13

*,*600–612. [CrossRef] [PubMed]

*Journal of Vision**,*8 (12): 8, 1–13, http://www.journalofvision.org/content/8/12/8, doi:10.1167/8.12.8. [PubMed] [Article]

*Vision Research**,*46

*,*1520–1529. [CrossRef] [PubMed]

*IEEE Journal of Selected Topics Signal Processing**,*6

*,*616–625. [CrossRef]

*Journal of the Optical Society of America A**,*16

*,*1554–1565. [CrossRef]

*IEEE Transactions on Image Processing**,*22

*,*1536–1547. [CrossRef] [PubMed]

*IEEE Transactions on Image Processing**,*20

*,*3207–3218. [CrossRef] [PubMed]

*IEEE Transactions on Image Processing**,*20

*,*2378–2386. [CrossRef] [PubMed]

*, the activation of each first-layer unit is: and the activation of each second-layer unit is*

^{I}^{(J·T)×I}is the weight matrix of the first layer and also the ISA transform matrix;

*I*,

*J*, and

*T*are the input dimension (number of pixels in a patch), subspace size (number of the first-layer units to be pooled for a second-layer unit), and number of the subspaces (number of the second-layer units), respectively. The row vectors of , as ISA bases, support a linear-transformed space and are grouped into

*T J*-D subspaces. ISA trains via sparse representations in the second-layer, by equivalently solving: where the training set

*J*·

*T*by PCA. The orthonormal constraint guarantees that transform is invertible.

_{r},

_{d}) = [

*s*(

_{r},

_{d})]

*[*

^{β}*c*(

_{r},

_{d})]

*[*

^{α}*l*(

_{r},

_{d})]

*SSIM where the structure comparison function is the contrast comparison function is the luminance comparison function is and column vectors*

^{γ}_{r}and

_{d}consist of pixels in the reference and distorted 8 × 8 patch from the same location, with the means of

*μ*

_{r}and

*μ*

_{d}, respectively.

_{r}∼

_{r}and

_{d}∼

_{d}, where the ISA transform is trained on zero-mean data and remain orthonormal. Hence, we have and || = | -

*μ*|. Then, obvious is the equivalence between the phase difference and the structure comparison of SSIM: and the relation between the amplitude difference and the contrast comparison:

*C*= [|

_{r}–

*μ*

_{r}|

^{2}+ |

_{d}–

*μ*

_{d}|

^{2}]

^{α}^{/2}

*q*with respect to the distortions

*d*is defined as: where parameters

*a*and

*b*control the shape of the log-logistic curve. We call it the additive log-logistic model, since it has a link form as: that is, the distortions sum up and yield the monotonically transformed quality. Here, we use three levels of summations as:

*l*and linearly weighted by parameter

*γ*at each scale), the middle sum is over the totally

_{l}*K*locations of patches all around image and normalized by

_{l}*K*, and the innermost sum is over the totally

_{l}*T*subspaces; they approximate the spatial-frequency pooling, the spatial pooling, and the subspace pooling, respectively. The local distortion

*d*combines the phase difference and amplitude difference, for instance but not limited to

*q*} and the subjectively rated {

*q̂*} is evaluated by the likelihood of {

*q*} given {

*q̂*}. We assume binomial distribution as the a priori distribution of {

*q̂*}, for the non-Gaussianity of opinion scores as well as the computational simplicity. Given totally

*M*independent databases where the

*m*-th database contain totally

*N*samples, the total log-likelihood is:

_{m}*a*,

_{m}*b*) is adaptive to the

_{m}*m*-th database but does not affect the ordinal prediction. By the gradient-descent method, the parameter estimation based on maximum likelihood has a solution below.

*γ*and

_{l}*a*are solved, so as to guarantee

_{m}*γ*and

_{l}*a*always positive.

_{m}