Free
Article  |   January 2014
Exploring V1 by modeling the perceptual quality of images
Author Affiliations
Journal of Vision January 2014, Vol.14, 26. doi:10.1167/14.1.26
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to Subscribers Only
      Sign In or Create an Account ×
    • Get Citation

      Fan Zhang, Wenfei Jiang, Florent Autrusseau, Weisi Lin; Exploring V1 by modeling the perceptual quality of images. Journal of Vision 2014;14(1):26. doi: 10.1167/14.1.26.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract
Abstract
Abstract:

Abstract  We propose an image quality model based on phase and amplitude differences between a reference and a distorted image. The proposed model is motivated by the fact that polar representations can separate visual information in a more independent and efficient manner than Cartesian representations in the primary visual cortex (V1). We subsequently estimate the model parameters from a large subjective data set using maximum likelihood methods. By comparing the various model hypotheses on the functional form about the phase and amplitude, we find that: (a) discrimination of visual orientation is important for quality assessment and yet a coarse level of such discrimination seems sufficient; and (b) a product-based amplitude–phase combination before pooling is effective, suggesting an interesting viewpoint about the functional structure of the simple cells and complex cells in V1.

Introduction
Adaptations are reflected in the structure and function of the human visual system, and many functional traits of the primary cortex are specialized for the environment or niche that we human beings normally occupy (Essock, Haun, & Kim, 2009; Karklin & Lewicki, 2008; Olshausen & Field, 1996). As we evaluate the quality of a distorted image compared to the reference image, our assessment originates from the difference between the neural responses. The early stages of the human visual system are believed to abstract critical differences and feed the following steps. Although it is an ill-posed problem to infer the underlying physiological process from the overt behavior judgment, a computational model that mimics some properties of visual neurons and predicts image quality accurately may, nevertheless, offer plausible arguments on how the neurons coact. 
The human ability to perceive images emerges in the ventral stream, beginning with simple and complex cells in the primary visual cortex (V1) at the first stage. Simple cells respond selectively to bars and edges at a particular location, of a particular orientation, and with particular bands of spatial frequencies. They act like edge detectors in the standard model (Hubel & Wiesel, 1962). However, since their responses saturate with intense stimuli, they are alternatively argued to act basically as phase detectors within their contrast saturation ranges (Zetzsche & Krieger, 1999). Complex cells differ from simple cells by showing relative invariance to a phase shift of the stimulus (i.e., a small translation perpendicular to the orientation of the stimulus) and responding as the magnitudes of multiple simple cells' responses (Adelson & Bergen, 1985; Hyvärianen & Hoyer, 2000, 2001), and hence, are often described as amplitude detectors. The second stage of the ventral system is modeled, for example, by computing products between pairs of V1 responses (both simple and complex) and averaging these products over local regions (Freeman & Simoncelli, 2011). 
Images may be decomposed into phase spectra and power spectra by spatial-frequency analysis, for example, by the Fourier transform. Phase-only reconstruction (Oppenheim & Lim, 1981) and power spectra swapping (Piotrowski & Campbell, 1982) preserve the essential image identity; this is usually interpreted as phase spectra containing more information than the global power spectra, also termed as “phase dominance.” By contrast, it is well known that phase-invariant features are important for visual recognition (Sampat, Wang, Gupta, Bovik, & Markey, 2009), since the human visual system should cope with small local deformations of stimuli. Such a paradox raises the question we discuss in this paper: What kind of phase is essential and how is the essential phase fused with the amplitude in the human visual system? 
Previous works have investigated the human response to artifact detection (Clarke, Green, & Chandler, 2012) and image classification (Joubert, Rousselet, Fabre-Thorpe, & Fize, 2009; Wichmann, Braun, & Gegenfurtner, 2006) under either phase or amplitude degradations. Recent functional imaging may detect the neural activity directly (Issa, Rosenberg, & Husson, 2008), but its spatial resolution is not as fine as neural organization and needs to be supplemented by mathematical models, for instance, multivariate analysis (Kamitani & Tong, 2005). In our study, the question of amplitude–phase coaction is explored by a computational model for image quality assessment (IQA). We model subjective image quality as a function of amplitude–phase combinations, test this model on a large collection of subjective image quality databases, and find the essential phase and the effective amplitude–phase combination leading to the most accurate quality prediction. 
We use the independent subspace analysis (ISA; Hyvärianen & Hoyer, 2000, 2001) as the functional form of simple and complex cells, and thus, the amplitude and the phase are simulated by the magnitude and the phase angle of the ISA response vector, respectively. Similar to independent component analysis (ICA; Jutten & Hérault, 1991), ISA is a method for finding underlying factors or bases from multidimensional statistical data, as reviewed briefly in Appendix A. What distinguishes ISA from ICA is that it searches for statistically independent subspaces consisting of separate groups of bases. The resultant bases and subspaces learned from natural image data resemble the receptive fields (RFs) of simple and complex cells. 
Another merit is that the discriminability on the locations and orientations of the ISA bases can be gradually suppressed by progressively reconfiguring the number of ISA subspaces. We find that, for IQA, an important factor is the phase difference that discriminates coarse changes in orientations (e.g., between vertical and horizontal), but the phase difference that further discriminates the finer orientations is less important. Phase difference provides compressive but essential information for IQA. 
We further compare the amplitude-only and the phase-only candidate models, as well as various combinations of phase and amplitude, including the sum-based, the product-based, and the polynomial-based candidates. The results show that the product-based model achieves the best accuracy, with a concise form similar to the recent ventral model (Freeman & Simoncelli, 2011). Thus, the amplitude–phase combination must occur prior to the pooling (i.e., counting together the neural responses across visual field) in functional structure. The optimal model parameters confirm that both amplitude and phase are indispensable for IQA. Additionally, the product-based model may explain human perception using negative film in photographic images. 
Phase and amplitude
Phase and amplitude, for 1D (temporal) signal, indicate the location and the strength of harmonic components at a (temporal) frequency respectively. For 2D signals, phase may represent the location of the harmonic component given both a frequency and an orientation. When stimuli translate perpendicularly to the orientation, the phase varies but the amplitude does not. If the phase is extended to both the location and the orientation of a harmonic component given a frequency, the amplitude remains invariant while the phase captures a small deformation of stimuli, no matter if the stimuli translate or rotate. 
In V1, the deformation-invariant characteristic emerges in complex cells. Such functional traits can be simulated by ISA. ISA estimates the RFs of simple cells from natural images, based on the energy model about complex cells and the sparsity of complex cells' responses. ISA assumes that each complex cell receives inputs from a special group of simple cells, and constitutes a concept of “subspace” (refer to Appendix A). Consequently, the ISA configuration includes the subspace size (i.e., the number of simple cells linked to a complex cell) and the total number of the subspaces (i.e., complex cells). For image patches with 8 × 8 pixels, the ISA is typically configured as ten 4D subspaces (Hyvärianen & Hoyer, 2000, 2001), named as 4D × 10 here (The notation JD × T means there are T subspaces and the size of each subspace is J). Despite the lack of physiological evidence, ISA supports flexible configurations and can simulate a theoretically optimal model of V1 at various ratios of complex cells to simple cells and various numbers of complex cells. Of course, such simulations only take account of the participating neurons, without consideration of the redundancy trait of the neural system. 
Figure 1a shows how the ISA bases evolve as the number of subspaces decreases (where groups of ISA bases are learned from natural image patches at various numbers of subspaces, the subspaces are delimited by red frames, the number of subspaces decreases from left to right, and the subspace size multiplied with the number of subspaces is approximately equal to the pixel dimension of image patches so as to obtain nearly complete ISA bases). Having fewer subspaces implies that the modeled RF area is larger (i.e., a lower discriminability on locations) with a wider range of orientations that the RFs of each subspace subtend (i.e., a lower discriminability on orientations). For example, the RFs only occupy a small region for Case 2D × 32, while the RFs spread all over the patches for Case 32D × 2. Nevertheless, the RFs show diversity in orientation and locations across subspaces for Case 2D × 32, while the RFs are grouped into two coarse orientations (the vertical and horizontal) for Case 32D × 2, in accordance with the orientation anisotropies of human visual cortex (Essock et al., 2009). Moreover, each RF pair shares the same orientation for Case 2D × 32, while the RFs of the same subspace still have diverse orientations and locations for Case 32D × 2. To conclude, it gets more difficult to discriminate the orientation, location, or both of ISA bases when the numbers of subspaces are configured decreasingly. 
Figure 1
 
Phase of ISA bases. (a) Complete ISA bases with decreasing numbers of subspaces leading to decreasing discriminability on orientations and locations. (b) Incomplete ISA bases with increasing subspace sizes and four subspaces maintaining consistent discriminability on orientations (with spectral wings) and locations (in spatial area), but gradually accumulating high spatial-frequency components (decentralizing spectrum). (c) Phase discriminating orientations (as the outer sector) and locations (as inner the sector). The bases without red frame are obtained by linearly interpolating on the nearest two bases with red frame. We set the size of ISA bases as 8 × 8 pixels and follow other default settings in FastICA toolbox of Hyvärianen's group.
Figure 1
 
Phase of ISA bases. (a) Complete ISA bases with decreasing numbers of subspaces leading to decreasing discriminability on orientations and locations. (b) Incomplete ISA bases with increasing subspace sizes and four subspaces maintaining consistent discriminability on orientations (with spectral wings) and locations (in spatial area), but gradually accumulating high spatial-frequency components (decentralizing spectrum). (c) Phase discriminating orientations (as the outer sector) and locations (as inner the sector). The bases without red frame are obtained by linearly interpolating on the nearest two bases with red frame. We set the size of ISA bases as 8 × 8 pixels and follow other default settings in FastICA toolbox of Hyvärianen's group.
Figure 1b shows that the ISA bases maintain consistent discriminability on orientations and locations with varying subspace sizes. To obtain the ISA bases, the dimension of the image patches is reduced to fourfold the subspace size by principal component analysis. Hence, the attained ISA bases are incomplete and support only the principal components. To assist in the analysis, the average Fourier spatial-frequency spectrum of each subspace is also shown on the left. As the subspace size increases, three phenomena can be observed: (a) HF components emerge and accumulate in the RFs as the spectrum decentralizes; (b) the discriminability on orientations does not change too much since the spectral wings of the RFs keep nearly the same separated orientations; and (c) the discriminability on locations remains as consistent as the effective regions in the RFs. 
Perceptual quality model
In a J dimensional (J-D) space, the polar representation with a scalar amplitude and a (J − 1)-D phase provides an alternative to the Cartesian representation. The polar representation is more likely to independently separate visual information than the Cartesian representation. V1 might provide suitable substrates for amplitude–phase encoding (Zetzsche & Krieger, 1999). Indeed, non-Cartesian cells are found in area V4 of macaque monkey (Gallant, Braun, & Van Essen, 1993), and sensitive to shape and size, but not location (Gallant, 2000). It is natural for a neural realization to perceive image quality based on the amplitude difference and the phase difference from the distorted image to the original image. The amplitude difference is defined as  where r and d are the pair of ISA response vectors corresponding to the reference and distorted image, and operator || calculates the magnitude of a vector. 
Moreover, the phase difference is defined as the angle between the vector pair,  where < , > calculates the inner product. The phase difference records how different the orientations of the vector pair are, and yet tells nothing about along which direction one vector deviates from the other. There is little physiological evidence for a wiring scheme adhering to (2), but a suitable substrate of phase difference might have been in front of our eyes given the existence of meridian-relative anisotropy of the human visual cortex, which refers to that the cortical response is enhanced when the local pattern orientation is coincident with the angular meridian compared with when it is tangential (Mannion, McDonald, & Clifford, 2010; Sasaki et al., 2006). Phase difference is also associated with computations such as the establishment of optic flow (Geisler, 1999) and geometric perspective (Bruce & Tsotsos, 2006). Such efficient visual representation seems to be a common principle across species (O'Carroll, Bidwell, Laughlin, & Warrant, 1996). 
Within each subspace, the phase difference is equal to the inverse cosine of the classic cosine similarity. It is worth noting a special case of the phase difference. When ISA converges to a single 64D subspace, the phase difference is equivalent to the structure comparison in the well-known Structure SIMilarity (SSIM metric; Wang, Bovik, Sheikh, & Simoncelli, 2004; Wang & Simoncelli, 2008). This is because the structure similarity is defined as the cosine similarity in the pixel space, and ISA, as an orthonormal transform, conserves the cosine similarity (see proof detail in Appendix B). 
After V1, the middle stage of the ventral system combines the V1 responses, both simple and complex (Ungerleider & Haxby, 1994). Let function f (ρ,θ; α,β,γ) represent the combination of the amplitude and the phase differences (f with parameters {α,β,γ} will be instantiated later). The functional responses over all ISA subspaces and all image patches are pooled by summing. Considering that simple cells show selectivity of spatial frequency, we down-sample an image progressively in a ratio of 1 : 2 to form a pyramid, calculate f (ρ,θ) by using the same ISA bases and the identical {α,β,γ} at each scale, and compute a weighted sum over all scales. The parameters {,,} are adaptive for different scales so as to fit the data, and will be estimated by regression (see Appendix C). 
After the ventral system, IQA may involve higher level vision stages, such as appraisal process (Scherer, Shorr, & Johnstone, 2001). Finally, subjective opinions tend to “saturate” for very bad or very good image quality, termed as floor and ceiling effects. It often happens for the subjective IQA protocols (ITU-R BT.500-11, P.910) and most psychological measurements (Aron, Aron, & Coups, 2006) except, for example, 2AFC experiment. It may be fitted by a sigmoid or Naka–Rushton-type curve. A log-logistic curve is adopted in video quality assessment standard ITU-T P.1202.2 Mode 2 (Zhang, Lin, Chen, & Ngan, 2013). It monotonically maps a distortion value f to a quality score q  where parameters a and b control the curve shape and thereby influence how much the floor and ceiling effects impact on the distortion f. The floor and ceiling effects depend on the context of the test materials, so a and b should be associated with each database. In this study, our regression method supports using a single set of {, , } for all databases and an adaptive set of {a, b} for each database (see details in Appendix C). Using adaptive {a, b} values can compensate the misaligned floor and ceiling effects across multiple datasets. The functional form of V1 is related to only {, , } but not {a, b}. Factually, {a, b} does not change the quality ranking for a database. 
The proposed perceptual quality metric is depicted in Figure 2. From the psychophysical stimuli to the visual psychological responses, the metric mainly captures the amplitude–phase response of V1, their combination in the ventral system, subspace pooling, spatial pooling, and spatial-frequency pooling, as well as the appraisal process (Scherer et al., 2001). For simplicity, we assume spatial pooling as a summation function, spatial-frequency pooling as a linear weighted sum, and treat the appraisal process as a two-parameter log-logistic; we focus on how the amplitude and phase coact in V1. The risk might be oversimplification of the unknown stages in the human visual system, but this shortcoming can be balanced and diminished by learning from a large dataset. Most importantly, the proposed metric belongs to the additive linear model (Hastie & Tibshirani, 1990), and thus the model parameters have a converged solution to maximum likelihood (see Appendix C). 
Figure 2
 
Illustration of the perceptual quality model, where the multiscale reference image (green) and distorted image (blue) are projected onto the ISA subspaces (red) patch by patch, and then the amplitude and phase difference between patch pairs are combined into a quality score. Dashed arrows represents the iterations over image scales (black), spatial locations (yellow), or subspaces (red). Phase difference records only the strength but not the direction of the phase deviation as illustrated in the top-right subfigure. The images are reproduced from the photo, Two Macaws by Steve Kelly, with permission from Eastman Kodak Company.
Figure 2
 
Illustration of the perceptual quality model, where the multiscale reference image (green) and distorted image (blue) are projected onto the ISA subspaces (red) patch by patch, and then the amplitude and phase difference between patch pairs are combined into a quality score. Dashed arrows represents the iterations over image scales (black), spatial locations (yellow), or subspaces (red). Phase difference records only the strength but not the direction of the phase deviation as illustrated in the top-right subfigure. The images are reproduced from the photo, Two Macaws by Steve Kelly, with permission from Eastman Kodak Company.
Evaluate image metrics on experimental data
We train the image quality model on existing experimental data. The data are collected from the publicly available and subjectively rated image quality databases (Winkler, 2012). Each database contains natural images, both original and distorted, and corresponding subjective quality opinions. A total of 11 databases are used, including LIVE, IVC, Toyama, TID2008, A57, WIQ, CSIQ, LAR, BA, FourierSB, and Meerwald. In Table 1, the test materials (including the number, distortion types, and format of images) and the experiment details (such as subjects, viewing setup, and subjective testing methods) are summarized. The image distortions include various kinds of additive noise, Gaussian blur, compression, channel loss, additive watermarks, and so forth. Figure 3 shows the most typical distortions. The subjective scores (i.e., mean opinion scores, or MOSs) represent the ground truth of image quality, and are used to evaluate the prediction accuracy of quality models. At first, we normalize all subjective scores to the range [0, 1] as shown in Figure 3; a MOS of 1 indicates the best quality while a MOS of 0 represents the worst quality. 
Figure 3
 
Typical distortions of TID2008 databases with subjective/objective quality scores (Ponomarenko et al., 2009). For each row, the icons respectively annotate the best and the worst score rated by the corresponding method.
Figure 3
 
Typical distortions of TID2008 databases with subjective/objective quality scores (Ponomarenko et al., 2009). For each row, the icons respectively annotate the best and the worst score rated by the corresponding method.
Table 1
 
Comparison of subjective-rated image quality database, where most information is cited in (Winkler, 2012). Hs = screen height, Hp = picture height; ACR = absolute category rating, PC = pair comparison method (ITU-T P.910); DSIS = double stimulus impairment scale, DSCQS = double stimulus continuous quality scale; MSCQS = multiple stimuli, continuous quality scale (ITU-R BT.500-11).
Table 1
 
Comparison of subjective-rated image quality database, where most information is cited in (Winkler, 2012). Hs = screen height, Hp = picture height; ACR = absolute category rating, PC = pair comparison method (ITU-T P.910); DSIS = double stimulus impairment scale, DSCQS = double stimulus continuous quality scale; MSCQS = multiple stimuli, continuous quality scale (ITU-R BT.500-11).
Database LIVE IVC Toyama TID A57 WIQ CSIQ LAR BA FourierSB Meerwald
Number of rated images 779 182 168 1,700 54 80 866 120 120 210 120
Number of distortion types 5 5 2 17 5 1 6 3 2 6 2
Image type Color Color Color Color Gray Gray Color Color Gray Gray Gray
Resolution ∼768 × 512 512 × 512 768 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512
Number of subjects 20 ∼ 29 15 16 33 7 30 5 ∼ 7 19 17 7 14
Screen 21″ CRT CRT 17″ CRT 19″ LCD Papers 17″ LCD/CRT LCD CRT 24″ LCD LCD 24″ LCD
Distance 2 ∼ 2.5 Hs 6 Hs 4 Hp Varying 4 Hp 4 ∼ 6 Hs 80 cm 4 Hp 6 Hs 6 Hs 6 Hs
Rating method ACR DSIS ACR PC MSCQS DSCQS MSCQS DSIS DSIS DSIS DSIS
Subjective data DMOS MOS Raw MOS DMOS Raw DMOS Raw Raw Raw Raw
The model accuracy is evaluated using two criteria, the likelihood (refer to Equation 7 in Appendix C) and the Spearman rank order correlation coefficient ρs. The likelihood Equation 7 measures the “agreement” of the additive log-logistic model with the subjective quality scores. The coefficient ρs evaluates the ordinal match between the predicted and the subjective quality scores, and thus remains invariant with any monotonic mapping of the data, including the two-parameter (i.e., a and b) log-logistic mapping in our model. For both criteria, the higher the value, the better the accuracy; ρs has a range of [−1, 1], while the likelihood does not have that constant range and its value also depends on the number and the distribution of the data. Hence, we use ρs to quantify the accuracy of metrics, and use the likelihood to assist in the comparison. 
It should be noted that the test conditions were not strictly consistent among the databases, and the subjective data needs to be aligned. Fortunately, the aforementioned additive log-logistic model can align the subjective data automatically and compensate the inconsistency to some extent. More importantly, these 11 databases provide ample data and reduce to the lowest bias about image selection, distortion generation, subjects' preference, and so forth. 
Finally, we choose the most accurate and concise model from a group of candidates. For each candidate model, we optimize its parameters to fit the subjective data, and then compare the candidates at their best accuracy. The model candidates use various RFs (i.e., the ISA bases under different configurations) and neural coactions (i.e., the amplitude–phase combination function f). The ISA bases are trained offline using FastICA toolbox from Hyvärianen's group, independent to the quality prediction or the model regression; actually the thirteen natural monochromatic images for ISA bases training are totally different from the images in the 11 databases. 
We conduct the following two experiments with two questions in mind: 
  • Question 1.   
    What kind of phase information is essential for IQA, or in other words, to what extent does IQA rely upon the discriminability on orientations and locations?
  • Question 2.   
    How do the amplitude and phase coact during IQA, and what is the optimal form of the combination function f?
Experiment 1
To answer Question 1, we conduct Experiment 1 in two steps. First, we find the lowest number of subspaces that can suffice for an accurate IQA. Second, we seek for the narrowest subspaces. 
By the first step, the impact of scalable phase information on IQA is evaluated. With a decreasing number of subspaces, the ISA base subtends wider orientations and takes bigger locations. That is, the phase of orientations and locations gradually gets more difficult to be discriminated, and therefore, more phase information is omitted by the quality model. 
Specifically, the size and the number of subspaces are configured as follows: 64D × 1, 32D × 2, 21D × 3, 16D × 4, 12D × 5, 10D × 6, 9D × 7, 8D × 8, 7D × 9, 6D × 10, 5D × 12, 4D × 16, 3D × 21, and 2D × 32, respectively. Moreover, we adopt 64 ICA bases as Case “1D × 64” for comparison, where θ = 0 constantly. Here, we keep the ISA bases as complete as possible. Then, we instantiate the combination f (ρ, θ) with a product:   
Equation 4 is used because it works better than other combination candidates in Experiment 2
The results are presented in Figure 4a. As the number of subspaces grows (see Figure 1a), the metric accuracy increases steeply from a single subspace, saturates promptly around four subspaces, peaks persistently for more subspaces, and yet falls finally for ICA. It is thus concluded that four subspaces can suffice for IQA. 
Figure 4
 
Metric accuracy with various configurations. (a) Various numbers of ISA subspaces, including the ICA-based case as 1D × 64 on the rightmost, (b) various ISA subspace sizes, and (c) various amplitude–phase combination functions. Green triangles mark the total log-likelihood, blue circles mark the average ρs on all databases, blue vertical bars depict the range of ρs on individual database; cyan crosses depict ρs on WIQ, red points depict ρs on TID16∼17, while the lowest ρs excluding WIQ and TID16∼17 are depicted by red crosses.
Figure 4
 
Metric accuracy with various configurations. (a) Various numbers of ISA subspaces, including the ICA-based case as 1D × 64 on the rightmost, (b) various ISA subspace sizes, and (c) various amplitude–phase combination functions. Green triangles mark the total log-likelihood, blue circles mark the average ρs on all databases, blue vertical bars depict the range of ρs on individual database; cyan crosses depict ρs on WIQ, red points depict ρs on TID16∼17, while the lowest ρs excluding WIQ and TID16∼17 are depicted by red crosses.
It is intuitive for a reader to argue that the first step above is not rigorous enough; however, although the number of subspaces decreases, each subspace expands and thus might compensate and sustain the metric accuracy. Let us move to the next step, to find the narrowest subspaces of necessity. 
In this step, given four subspaces, the influence of subspace size on IQA is evaluated. As the subspace expands from a low to a high dimension, the number of ISA bases increases and turn to complete (see Figure 1b). Specifically, the subspaces size is configured as 4D × 4, 6D × 4, 8D × 4, 10D × 4, 12D × 4, 14D × 4 respectively, as well as 16D × 4. 
The result is shown in Figure 4b. As the size of the four subspaces expands, the metric accuracy increases from 4D slightly, peaks around 8D, and holds until the full dimension. Hence, in Experiment 1, the fact that the case with four subspaces is as good as those with more, but narrower, subspaces should not be attributed to the subspace size, but to the number of subspaces. Actually, 8D × 4 bases support only half of the dimensions but guarantee an efficient IQA. This is not surprising because only the LF (low spatial-frequency) components are taken into account but the HF is ignored. On one hand, V1 is not sensitive to the HF at the finest image scale; on the other hand the HF at current image scale is overrepresented by the LF at the finer image scale. 
Indeed, the case of product-based 8D × 4 is as accurate as the state-of-the-art metrics like FSIM (Zhang, Zhang, Mou, & Zhang, 2011) and SIQM (Narwaria, Lin, McLoughlin, Emmanue, & Chia, 2012), and outperforms the CW-SSIM (Sampat et al., 2009) and MSE (mean squared error), as can be seen in Table 2. We split TID database into two subsets (columns 4 and 5), TID1∼15 with the first 15 distortion types and TID16∼17 with the last two types, namely “intensity shift” and “contrast change,” which globally adjust the mean and the variance of images, respectively, as shown in Figure 3. We separate TID16∼17 from the full set because the proposed metric is not good at it, which we will discuss later. 
Table 2
 
Comparison of metrics' performance in terms of ρs over individual database.
Table 2
 
Comparison of metrics' performance in terms of ρs over individual database.
Database LIVE IVC Toyama TID1 15 TID16 17 A57 WIQ CSIQ LAR BA FourierSB Meerwald
Proposed 0.948 0.910 0.925 0.907 0.281 0.913 0.799 0.961 0.930 0.931 0.906 0.928
MSE 0.856 0.679 0.613 0.532 0.476 0.570 0.817 0.806 0.819 0.934 0.696 0.891
CW-SSIM 0.852 0.621 0.784 0.642 0.482 0.656 0.621 0.577 0.920 0.631 0.055 0.795
SIQM 0.956 0.894 0.915 0.831 0.807 0.894 0.842 0.924 0.892 0.952 0.846 0.940
FSIM 0.963 0.926 0.906 0.882 0.881 0.918 0.806 0.924 0.958 0.934 0.914 0.930
Experiment 2
Experiment 2 evaluates the effectiveness of the amplitude–phase combinations and provides an answer to Question 2. We consider the product-based Equation 4 and the sum-based function:   
In Equations 4 and 5, α and β approximate the nonlinear responses to the amplitude and phase, respectively. When α or β takes a value of 1, it is degenerated into a linear response, and when α or β takes a small value near 0, it implies no response and the amplitude or phase term can be omitted. Note that we have assumed subspace pooling, spatial pooling, and spatial-frequency pooling all as summation functions. Accordingly, the product-based function suggests that the amplitude–phase combination is inseparable, and thus, prior to pooling, while the sum-based function implies that the phase and the amplitude may be decoupled and thus, there is no priority between them. Such a difference not only reflects mathematical logic, but also has potentially implications for neural structure. 
We take account of another two simpler forms, including the phase-only:  and the amplitude-only:  which may be regarded as the special sum-based Equation 5 with γ → ∞ and γ → 0, respectively. We furthermore consider the third order polynomial approximation:   
As shown in Figure 4c, the product-based model outperforms other models. Note that it employs less parameters than the sum-based model and the third order polynomial model. 
Discussion
The experiments above exploit the computational models of IQA. The optimal form of the models suggests that a potentially sufficient polar representation is operative in the human visual system. 
Polar representation does not reduce the dimension of signal, but the phase difference does. Note that the phase difference, as a scalar value of an angle, only records the strength of phase deviation but disregards the direction of phase deviation in the high dimensional subspace, as shown by the top-left illustration in Figure 2. Here, the direction indicates how stimuli deform within a range of locations and orientations. The higher the space dimension, the more the phase about the deviation direction is discarded. 
Experiment 1 compares various ISA subspaces, which simulate diversified orientation preferences of V1. With more subspaces, the denser the orientation pinwheel becomes, where each simulated RF subtends a narrower range. Phase difference θ records how far away the phase deviates within a limited range of orientations and locations, but is blind to the orientation or location from which the phase departs. Therefore, with fewer subspaces, more phase information is disregarded by the quality model. 
The result of Experiment 1 suggests that, overall, four subspaces are necessary and sufficient for IQA. This, on one hand, confirms the necessity of the orientation selectivity possessed by V1, and on the other hand implies that a very coarse orientation discriminability suffices for a visual task such as IQA. Nevertheless, the factual orientation preference map is much denser. For example, in the V1 of cats that represents the central visual field, the RFs subtend an angle of about 1.2° (Hubel & Wiesel, 1962; Tusa, Palmer, & Rosenquist, 1978). This is not surprising, due to the redundancy trait of neural system and the requirement of more challenging tasks such as identification. 
Experiment 1 further highlights Case 8D × 4 for its efficiency (see Table 2 for the comparison with the state-of-the-art metrics), which suggests an interesting structure of V1. Every eight simple cells seem sufficient for the linkage to each complex cell, and the orientation-selective simple cells are grouped into four categories, including the vertical, horizontal, and oblique. Note that each simple cell is a phase detector (Zetzsche & Krieger, 1999), while a complex cell is an amplitude detector. Thus, despite the human visual system detects the total change of the phase detectors' responses, it disregards exactly which phase detectors contribute to the change. In other words, the deformation of stimuli in orientations and locations are detected but they are suppressed in the final analysis. Consequently, only an essential and compressive amount of phase information is factually encoded and forwarded. 
It is also interesting to note that in the case of a single subspace, i.e., 64D × 1, Experiment 1 links with the classic SSIM metric. Equation 4 is similar to SSIM as proved in Appendix B. In SSIM, the direction in which the phase deviates is totally neglected; in other words, equal values of phase difference across any orientations and any locations are perceived as being equal. The disadvantage of 64D × 1 over the cases with more subspaces implies that SSIM could be refined with additional consideration of orientations. 
Note again that we use a single set of parameters {αl, βl, γl|l = 1, 2, 3} for all databases and adaptive parameters {am, bm|m = 1, 2, …, 11} for the m-th database. Due to the redundancy between {} and a, we set γ1 = 1 constantly. That is, although we introduce 31 parameters, the degree of freedom for the proposed V1 model is only 8. 
As listed in Table 3, most of the {, β, } are consistent under various configurations of ISA subspaces, except β1 for the phase at the finest scale. The phase of the finest HF components is insignificant for IQA as β1 can be omitted (i.e., set as 0) especially when complete ISA bases are configured. This is in accordance with the view that amplitude is more important than phase at fine image scales (Field & Chandler, 2012). 
Table 3
 
Parameter consistency in terms of mean ± thrice standard deviation.
Table 3
 
Parameter consistency in terms of mean ± thrice standard deviation.
Completeness of ISA bases Complete bases in Experiment 1 Incomplete bases in Experiment 2
Scale l (Fine → Coarse) Scale 1 Scale 2 Scale 3 Scale 1 Scale 2 Scale 3
αl: exponent of amplitude error 1.96 ± 0.40 0.62 ± 0.16 1.08 ± 0.51 2.02 ± 0.32 0.53 ± 0.12 1.14 ± 0.07
βl: exponent of phase error (2.2 ± 7.6) × 10−4 0.45 ± 0.20 0.84 ± 0.22 0.60 ± 0.72 0.45 ± 0.14 0.57 ± 0.09
logγl: weight of image scale 0 7.14 ± 0.46 6.73 ± 0.81 0 7.11 ± 0.24 5.97 ± 0.85
Experiment 2 compares various amplitude–phase combinations, which simulate different coactions between simple and complex cells. Among the candidates, including summation, product, the third order polynomial, and so forth, the product performs best. This result agrees with the product-based assumption on the ventral system (Freeman & Simoncelli, 2011). The phase-only and the amplitude-only models, as special cases of the sum-based model, are also compared, but their inferior performance confirms that both the phase and amplitude are indispensable for IQA. 
The product-based combination suggests a viewpoint of perceptual distortion that is quite different from traditional quality metrics. We can explain this via the iso-distance map of metric Equation 4. Given a reference point o and a distance d, the iso-distance curve consists of all the points that are located at distance d from o. Let us consider a 2D polar coordinate system for simplification. Given a reference point with radius of 1 and phase angle of 0, noted by (1, 0), its iso-distance map under metric Equation 4 is shown in Figure 5a. A point has a distance of zero from the reference, as long as either its phase or its amplitude remains the same as the reference. This differs from the iso-distance map under MSE as shown in Figure 5b, where a point moves farther away from the reference point unless both its amplitude and phase are equal to that of the reference. Such a difference is because the metric of Equation 4 employs a product to combine the amplitude and phase error, while MSE approximates to a sum of two items related to the amplitude and phase error, respectively (Hsiao & Millane, 2004). 
Figure 5
 
Iso-distance maps of (1, 0) under distance metric (a) d = |ρ|0.6 θ0.5 and (b) MSE.
Figure 5
 
Iso-distance maps of (1, 0) under distance metric (a) d = |ρ|0.6 θ0.5 and (b) MSE.
It is intuitive that the phase-invariant distortions impair images mildly, as human eyes often show adaptation and tolerance to the contrast adjustment or to high dynamic range mapping for images. Besides, it is partly reasonable, although counterintuitive, that the amplitude-invariant distortions only marginally degrade images. Amplitude-invariant distortions do not occur commonly, yet this is relevant in negative film photography. Generally, negative images (i.e., amplitude-invariant and inverse-phase images) are still recognizable. Of course this does not mean that a negative image deserves a rating of perfect quality, but it seems more unreasonable to evaluate a negative image worse than a random noise image like most existing metrics do. 
Compared to summation, a product of terms is more common for combining two incommensurate quantities. Considering that the quality model pools the results of the combinations by summation, combining by product suggests that the combination and the pooling are not commutative. Otherwise, if combining by sum was preferred, the combination and the pooling would be commutative and combination followed by pooling would be equivalent to combination after pooling. 
The proposed models appear to perform inconsistently on the 11 databases since there are large (blue vertical) error bars of ρs in Figure 4. This is mainly because of their low accuracy on the TID16∼17 and the WIQ database. If excluding them, the worst accuracies (marked by red crosses) are not significantly lower than the best ones. The proposed metric inaccurately measures “intensity shift” because the image mean is overlooked by using the ISA bases which are obtained from the whitened data with zero mean. “Contrast change” is not measured appropriately here, because contrast change (i.e., amplitude difference) is simply regarded as distortions no matter if the contrast is enhanced or degraded. Most existing metrics fail to handle WIQ (as shown in Table 2), because the image distortion in WIQ, termed as wireless channel distortion, is often uneven and localized. Hence, the simulated RF of ISA bases, the distortion factor of absolute amplitude difference, and the pooling strategy of summation function are probably too oversimplified, since subjective assessment for such distortions may involve a more complex process in high-level vision. 
Although the proposed computational model is an advance in that it can mimic the nonlinear properties of neurons, it is still oversimplified; it focuses on V1 and ignores mechanisms beyond V1. Despite a good match between the model predictions and behavior judgment, our model cannot be regarded as the uniquely correct computational approximation of the underlying physiological process. Using a reasonable set of localized filters, we can devise more image quality metrics, such as a PCA (principal component analysis)-based metric (Zhang, Liu, Lin, & Ngan, 2011). The essential issue is to carefully design the weights for each filter. Using overcomplete filters is equivalent to weighting the filters, so could also promise a good result. However, a key point here is that the ISA bases at identical image scales are assigned with equal weights (i.e., the same α, β, and γ), since there is no evidence that any set of V1 neurons have priority or account for the majority. 
Conclusions
We have studied a simple and accurate image quality metric, where all parameters can be trained by a converged algorithm and where no parameter is set empirically. The metric simulates the RFs of V1 using ISA bases, simple cells using phase detectors, complex cells using amplitude detectors, their coactions using the product, and the later stages using a log-logistic mapping. We make the least possible a priori assumptions for this metric. 
In the comparative study, a metric based on various ISA bases and amplitude–phase combinations stands out, and suggests the following views about V1: 
  1.  
    Both phase and amplitude are indispensible for IQA, and thus, both simple cells and complex cells contribute to IQA. Besides the amplitude detection, the phase difference provides another potential way of information reduction.
  2.  
    Not all the phase information is helpful for IQA; only the phase that discriminates coarse orientations is essential.
  3.  
    The product of phase and amplitude can capture these combinations and thus, the coactions between simple and complex cells rather than summation or other nonlinear operators, and thereby the human visual system tolerates the amplitude-invariant and phase-invariant distortions.
  4.  
    The amplitude–phase combination occurs prior to the pooling, which implies the linkages among simple and complex cells precede the aggregation of the neurons that represents various locations of the visual field.
Acknowledgments
The authors thank Imants Svalbe, Jason Wang, Longin Jan Latecki, and Zhibo Chen for insightful discussions. 
Commercial relationships: The Technicolor Company has filed a patent about image quality metrics on which F. Zhang and W. Jiang are listed as inventors. No other author has a proprietary interest in any material or method mentioned. 
Corresponding author: Fan Zhang. 
Email: fan.zhang@ieee.org. 
Address: Lincoln House, Quarry Bay, Hong Kong. 
References
Adelson E. H. Bergen J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2, 284–299. [CrossRef]
Aron A. Aron E. N. Coups E. (2006). Statistics for psychology. Upper Saddle River, NJ: Prentice Hall.
Bruce N. D. Tsotsos J. K. (2006). A statistical basis for visual field anisotropies. Neurocomputing, 69, 1301–1304. [CrossRef]
Clarke A. D. F. Green P. R. Chandler M. J. (2012). The effects of display time and eccentricity on the detection of amplitude and phase degradations in textured stimuli. Journal of Vision, 12 (3): 7, 1–11, http://www.journalofvision.org/content/12/3/7, doi:10.1167/12.3.7. [PubMed] [Article]
Essock E. A. Haun A. M. Kim Y. J. (2009). An anisotropy of orientation-tuned suppression that matches the anisotropy of typical natural scenes. Journal of Vision, 9 (1): 35, 1–15, http://www.journalofvision.org/content/9/1/35, doi:10.1167/9.1.35. [PubMed] [Article] [PubMed]
Field D. J. Chandler D. M. (2012). Method for estimating the relative contribution of phase and power spectra to the total information in natural-scene patches. Journal of the Optical Society of America A, 29, 55–67. [CrossRef]
Freeman J. Simoncelli E. P. (2011). Metamers of the ventral stream. Nature Neurosceince, 14, 1195–1201. [CrossRef]
Gallant J. L. (2000). The neural representation of shape. In De Valois K. K. (Ed.), Seeing ( 2nd Ed., 311–324), San Diego, CA: Academic Press.
Gallant J. L. Braun J. Van Essen D. C. (1993). Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. Science, 259, 100–103. [CrossRef] [PubMed]
Geisler W. S. (1999). Motion streaks provide a spatial code for motion direction. Nature, 400, 65–69. [CrossRef] [PubMed]
Hastie T. J. Tibshirani R. J. (1990). Generalized Additive Models. New York: Chapman and Hall.
Hsiao W. H. Millane R. P. (2004). Effects of spectral amplitude and phase errors on image reconstruction. Proceedings of SPIE, 5562, 27–37.
Hubel D. H. Wiesel T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Phsyiology, 160, 106–154. [CrossRef]
Hyvärianen A. Hoyer P. (2000). Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspace. Neural Computation, 12, 1705–1720. [CrossRef] [PubMed]
Hyvärianen A. Hoyer P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41, 2413–2423. [CrossRef] [PubMed]
Issa N. P. Rosenberg A. Husson T. R. (2008). Models and measurement of functional maps in V1. Journal of Neurophysiology, 99, 2745–2754. [CrossRef] [PubMed]
Joubert O. R. Rousselet G. A. Fabre-Thorpe M. Fize D. (2009). Rapid visual categorization of natural scene context with equalized amplitude spectrum and increasing phase noise. Journal of Vision, 9 (1): 2, 1–16, http://www.journalofivison.org/content/9/1/2, doi:10.1167/9.1.2. [PubMed] [Article] [PubMed]
Jutten C. Hérault J. (1991). Blind separation of source, part I: An adaptive algorithm based on neuromimetic architecture. Signal Process, 24, 1–10. [CrossRef]
Kamitani Y. Tong F. (2005). Decoding the visual and subjective contents of the human brain. Nature Neuroscience, 8, 679–685. [CrossRef] [PubMed]
Karklin Y. Lewicki M. S. (2008). Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457, 83–86. [CrossRef] [PubMed]
Mannion D. J. McDonald J. S. Clifford C. W. G. (2010). Orientation anisotropies in human visual cortex. Journal of Neurophysiology, 103, 3465–3471. [CrossRef] [PubMed]
Narwaria M. Lin W. McLoughlin I. Emmanue S. Chia L. T. (2012). Fourier transform based scalable image quality measure. IEEE Transactions on Image Processing, 21, 3364–3377. [CrossRef] [PubMed]
O'Carroll D. C. Bidwell N. J. Laughlin S. B. Warrant E. J. (1996). Insect motion detectors matched to visual ecology. Nature, 382, 63–66. [CrossRef] [PubMed]
Olshausen B. A. Field D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. [CrossRef] [PubMed]
Oppenheim A. V. Lim J. S. (1981). The importance of phase in signals. Proceedings of the IEEE, 69, 529–541. [CrossRef]
Piotrowski L. N. Campbell F. W. (1982). A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception, 11, 337–346. [CrossRef] [PubMed]
Ponomarenko N. Lukin V. Zelensky A. Egiazarian K. Carli M. Battisti F. (2009). TID2008—A database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics, 10, 30–45.
Sampat M. P. Wang Z. Gupta S. Bovik A. C. Markey M. (2009). Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, 18, 2385–2401. [CrossRef] [PubMed]
Sasaki Y. Rajimehr R. Kim B. W. Ekstrom L. B. Vanduffel W. Tootell R. B. (2006). The radial bias: A different slant on visual orientation sensitivity in human and nonhuman primates. Neuron, 51, 661–670. [CrossRef] [PubMed]
Scherer K. R. Shorr A. Johnstone T. (2001). Appraisal processes in emotion: theory, methods, research. Canary, NC: Oxford University Press.
Tusa R. J. Palmer L. A. Rosenquist A. C. (1978). The retinotopic organization of area 17 (striate cortex) in the cat. Journal of Comparative Neurology, 177, 213–235. [CrossRef] [PubMed]
Ungerleider L. G. Haxby J.V. (1994). “What” and “where” in the human brain. Current Opinion Neurobiology, 4, 157–165. [CrossRef]
Wang Z. Bovik A. C. Sheikh H.R. Simoncelli E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612. [CrossRef] [PubMed]
Wang Z. Simoncelli E. P. (2008). Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities. Journal of Vision, 8 (12): 8, 1–13, http://www.journalofvision.org/content/8/12/8, doi:10.1167/8.12.8. [PubMed] [Article]
Wichmann F. A. Braun D. I. Gegenfurtner K. R. (2006). Phase noise and the classification of natural images. Vision Research, 46, 1520–1529. [CrossRef] [PubMed]
Winkler S. (2012). Analysis of public image and video databases for quality assessment. IEEE Journal of Selected Topics Signal Processing, 6, 616–625. [CrossRef]
Zetzsche C. Krieger G. (1999). The atoms of vision: Cartesian or polar? Journal of the Optical Society of America A, 16, 1554–1565. [CrossRef]
Zhang F. Lin W. Chen Z. Ngan K. N. (2013). Additive log-logistic model for modeling networked video quality. IEEE Transactions on Image Processing, 22, 1536–1547. [CrossRef] [PubMed]
Zhang F. Liu W. Lin W. Ngan K. N. (2011). Spread spectrum image watermarking based on perceptual quality metric. IEEE Transactions on Image Processing, 20, 3207–3218. [CrossRef] [PubMed]
Zhang L. Zhang L. Mou X. Zhang D. (2011). FSIM: a feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20, 2378–2386. [CrossRef] [PubMed]
Appendix A: Independent subspace analysis
ISA has two layers; the first-layer unit simulates the RF of simple cells, while each of the second-layer units pools over a small neighborhood of adjacent first-layer units to mimic complex cells. To be precise, given the input image patch ⊂ ℜI, the activation of each first-layer unit is:  and the activation of each second-layer unit is   
∈ ℜ(J·T)×I is the weight matrix of the first layer and also the ISA transform matrix; I, J, and T are the input dimension (number of pixels in a patch), subspace size (number of the first-layer units to be pooled for a second-layer unit), and number of the subspaces (number of the second-layer units), respectively. The row vectors of , as ISA bases, support a linear-transformed space and are grouped into T J-D subspaces. ISA trains via sparse representations in the second-layer, by equivalently solving:  where the training set Display FormulaImage not available are whitened to have zero mean and identity covariance, and also have the dimension reduced to J·T by PCA. The orthonormal constraint guarantees that transform is invertible. 
Appendix B: Proof of relation to SSIM
The Structure SIMilarity (SSIM) (Wang et al., 2004; Wang & Simoncelli, 2008) uses a product-based combination of three comparisons, as SSIM(r, d) = [s(r, d)]β [c(r, d)]α [l(r, d)]γ SSIM where the structure comparison function is   the contrast comparison function is   the luminance comparison function is   and column vectors r and d consist of pixels in the reference and distorted 8 × 8 patch from the same location, with the means of μr and μd, respectively. 
If and only if the ISA converges to a single subspace, we have   for both rr and dd, where the ISA transform is trained on zero-mean data and remain orthonormal. Hence, we have   and || = | - μ|. Then, obvious is the equivalence between the phase difference and the structure comparison of SSIM:   and the relation between the amplitude difference and the contrast comparison:   
Consequently, Equation 4 can be rewritten as:  where C = [|rμr|2 + |dμd|2]α/2 
That is, Equation 4 combines two essential components of SSIM in a slightly different manner. 
Appendix C: Additive log-logistic model
The function of the perceptual quality q with respect to the distortions d is defined as:  where parameters a and b control the shape of the log-logistic curve. We call it the additive log-logistic model, since it has a link form as:  that is, the distortions sum up and yield the monotonically transformed quality. Here, we use three levels of summations as:   
The outermost sum on the right-hand side is over the image scales (indexed by l and linearly weighted by parameter γl at each scale), the middle sum is over the totally Kl locations of patches all around image and normalized by Kl, and the innermost sum is over the totally T subspaces; they approximate the spatial-frequency pooling, the spatial pooling, and the subspace pooling, respectively. The local distortion d combines the phase difference and amplitude difference, for instance but not limited to   
The goodness-of-fit between the predicted {q} and the subjectively rated {} is evaluated by the likelihood of {q} given {}. We assume binomial distribution as the a priori distribution of {}, for the non-Gaussianity of opinion scores as well as the computational simplicity. Given totally M independent databases where the m-th database contain totally Nm samples, the total log-likelihood is:   
The metric accuracy is evaluated by the maximal total log-likelihood, with the estimated parameter sets {, , } and {, }. Note that {, , } remain constant for different databases, while (am,bm) is adaptive to the m-th database but does not affect the ordinal prediction. By the gradient-descent method, the parameter estimation based on maximum likelihood has a solution below.       
Note that the logarithmic γl and am are solved, so as to guarantee γl and am always positive. 
Figure 1
 
Phase of ISA bases. (a) Complete ISA bases with decreasing numbers of subspaces leading to decreasing discriminability on orientations and locations. (b) Incomplete ISA bases with increasing subspace sizes and four subspaces maintaining consistent discriminability on orientations (with spectral wings) and locations (in spatial area), but gradually accumulating high spatial-frequency components (decentralizing spectrum). (c) Phase discriminating orientations (as the outer sector) and locations (as inner the sector). The bases without red frame are obtained by linearly interpolating on the nearest two bases with red frame. We set the size of ISA bases as 8 × 8 pixels and follow other default settings in FastICA toolbox of Hyvärianen's group.
Figure 1
 
Phase of ISA bases. (a) Complete ISA bases with decreasing numbers of subspaces leading to decreasing discriminability on orientations and locations. (b) Incomplete ISA bases with increasing subspace sizes and four subspaces maintaining consistent discriminability on orientations (with spectral wings) and locations (in spatial area), but gradually accumulating high spatial-frequency components (decentralizing spectrum). (c) Phase discriminating orientations (as the outer sector) and locations (as inner the sector). The bases without red frame are obtained by linearly interpolating on the nearest two bases with red frame. We set the size of ISA bases as 8 × 8 pixels and follow other default settings in FastICA toolbox of Hyvärianen's group.
Figure 2
 
Illustration of the perceptual quality model, where the multiscale reference image (green) and distorted image (blue) are projected onto the ISA subspaces (red) patch by patch, and then the amplitude and phase difference between patch pairs are combined into a quality score. Dashed arrows represents the iterations over image scales (black), spatial locations (yellow), or subspaces (red). Phase difference records only the strength but not the direction of the phase deviation as illustrated in the top-right subfigure. The images are reproduced from the photo, Two Macaws by Steve Kelly, with permission from Eastman Kodak Company.
Figure 2
 
Illustration of the perceptual quality model, where the multiscale reference image (green) and distorted image (blue) are projected onto the ISA subspaces (red) patch by patch, and then the amplitude and phase difference between patch pairs are combined into a quality score. Dashed arrows represents the iterations over image scales (black), spatial locations (yellow), or subspaces (red). Phase difference records only the strength but not the direction of the phase deviation as illustrated in the top-right subfigure. The images are reproduced from the photo, Two Macaws by Steve Kelly, with permission from Eastman Kodak Company.
Figure 3
 
Typical distortions of TID2008 databases with subjective/objective quality scores (Ponomarenko et al., 2009). For each row, the icons respectively annotate the best and the worst score rated by the corresponding method.
Figure 3
 
Typical distortions of TID2008 databases with subjective/objective quality scores (Ponomarenko et al., 2009). For each row, the icons respectively annotate the best and the worst score rated by the corresponding method.
Figure 4
 
Metric accuracy with various configurations. (a) Various numbers of ISA subspaces, including the ICA-based case as 1D × 64 on the rightmost, (b) various ISA subspace sizes, and (c) various amplitude–phase combination functions. Green triangles mark the total log-likelihood, blue circles mark the average ρs on all databases, blue vertical bars depict the range of ρs on individual database; cyan crosses depict ρs on WIQ, red points depict ρs on TID16∼17, while the lowest ρs excluding WIQ and TID16∼17 are depicted by red crosses.
Figure 4
 
Metric accuracy with various configurations. (a) Various numbers of ISA subspaces, including the ICA-based case as 1D × 64 on the rightmost, (b) various ISA subspace sizes, and (c) various amplitude–phase combination functions. Green triangles mark the total log-likelihood, blue circles mark the average ρs on all databases, blue vertical bars depict the range of ρs on individual database; cyan crosses depict ρs on WIQ, red points depict ρs on TID16∼17, while the lowest ρs excluding WIQ and TID16∼17 are depicted by red crosses.
Figure 5
 
Iso-distance maps of (1, 0) under distance metric (a) d = |ρ|0.6 θ0.5 and (b) MSE.
Figure 5
 
Iso-distance maps of (1, 0) under distance metric (a) d = |ρ|0.6 θ0.5 and (b) MSE.
Table 1
 
Comparison of subjective-rated image quality database, where most information is cited in (Winkler, 2012). Hs = screen height, Hp = picture height; ACR = absolute category rating, PC = pair comparison method (ITU-T P.910); DSIS = double stimulus impairment scale, DSCQS = double stimulus continuous quality scale; MSCQS = multiple stimuli, continuous quality scale (ITU-R BT.500-11).
Table 1
 
Comparison of subjective-rated image quality database, where most information is cited in (Winkler, 2012). Hs = screen height, Hp = picture height; ACR = absolute category rating, PC = pair comparison method (ITU-T P.910); DSIS = double stimulus impairment scale, DSCQS = double stimulus continuous quality scale; MSCQS = multiple stimuli, continuous quality scale (ITU-R BT.500-11).
Database LIVE IVC Toyama TID A57 WIQ CSIQ LAR BA FourierSB Meerwald
Number of rated images 779 182 168 1,700 54 80 866 120 120 210 120
Number of distortion types 5 5 2 17 5 1 6 3 2 6 2
Image type Color Color Color Color Gray Gray Color Color Gray Gray Gray
Resolution ∼768 × 512 512 × 512 768 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512
Number of subjects 20 ∼ 29 15 16 33 7 30 5 ∼ 7 19 17 7 14
Screen 21″ CRT CRT 17″ CRT 19″ LCD Papers 17″ LCD/CRT LCD CRT 24″ LCD LCD 24″ LCD
Distance 2 ∼ 2.5 Hs 6 Hs 4 Hp Varying 4 Hp 4 ∼ 6 Hs 80 cm 4 Hp 6 Hs 6 Hs 6 Hs
Rating method ACR DSIS ACR PC MSCQS DSCQS MSCQS DSIS DSIS DSIS DSIS
Subjective data DMOS MOS Raw MOS DMOS Raw DMOS Raw Raw Raw Raw
Table 2
 
Comparison of metrics' performance in terms of ρs over individual database.
Table 2
 
Comparison of metrics' performance in terms of ρs over individual database.
Database LIVE IVC Toyama TID1 15 TID16 17 A57 WIQ CSIQ LAR BA FourierSB Meerwald
Proposed 0.948 0.910 0.925 0.907 0.281 0.913 0.799 0.961 0.930 0.931 0.906 0.928
MSE 0.856 0.679 0.613 0.532 0.476 0.570 0.817 0.806 0.819 0.934 0.696 0.891
CW-SSIM 0.852 0.621 0.784 0.642 0.482 0.656 0.621 0.577 0.920 0.631 0.055 0.795
SIQM 0.956 0.894 0.915 0.831 0.807 0.894 0.842 0.924 0.892 0.952 0.846 0.940
FSIM 0.963 0.926 0.906 0.882 0.881 0.918 0.806 0.924 0.958 0.934 0.914 0.930
Table 3
 
Parameter consistency in terms of mean ± thrice standard deviation.
Table 3
 
Parameter consistency in terms of mean ± thrice standard deviation.
Completeness of ISA bases Complete bases in Experiment 1 Incomplete bases in Experiment 2
Scale l (Fine → Coarse) Scale 1 Scale 2 Scale 3 Scale 1 Scale 2 Scale 3
αl: exponent of amplitude error 1.96 ± 0.40 0.62 ± 0.16 1.08 ± 0.51 2.02 ± 0.32 0.53 ± 0.12 1.14 ± 0.07
βl: exponent of phase error (2.2 ± 7.6) × 10−4 0.45 ± 0.20 0.84 ± 0.22 0.60 ± 0.72 0.45 ± 0.14 0.57 ± 0.09
logγl: weight of image scale 0 7.14 ± 0.46 6.73 ± 0.81 0 7.11 ± 0.24 5.97 ± 0.85
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×