We propose an efficient methodology for comparing computational models of a perceptually discriminable quantity. Rather than comparing model responses to subjective responses on a set of pre-selected stimuli, the stimuli are computer-synthesized so as to optimally distinguish the models. Specifically, given two computational models that take a stimulus as an input and predict a perceptually discriminable quantity, we first synthesize a pair of stimuli that maximize/minimize the response of one model while holding the other fixed. We then repeat this procedure, but with the roles of the two models reversed. Subjective testing on pairs of such synthesized stimuli provides a strong indication of the relative strengths and weaknesses of the two models. Specifically, the model whose extremal stimulus pairs are easier for subjects to discriminate is the better model. Moreover, careful study of the synthesized stimuli may suggest potential ways to improve a model or to combine aspects of multiple models. We demonstrate the methodology for two example perceptual quantities: contrast and image quality.

*N*-dimensional space would require a total of 2

^{ N}samples, an unimaginably large number for stimulus spaces with dimensionality on the order of thousands to millions.

*S*and a perceptually discriminable quantity

*q*(

*s*), defined for all elements

*s*in

*S*. We also assume a subjective assessment environment, in which a human subject can compare the perceptual quantity

*q*(

*s*) for any stimulus

*s*with the value for another stimulus

*s*′. The goal is to compare two computational models,

*M*

_{1}and

*M*

_{2}(each of them takes any stimulus

*s*in

*S*as the input and gives a prediction of

*q*(

*s*)), to determine which provides a better approximation of

*q*based on a limited number of subjective tests.

*q*. Average subjective responses are then compared with model responses, and the model (say) that predicts a higher percentage of responses correctly is declared the winner.

*q,*then they constitute strong evidence against the model that was held constant. The same test may be performed for stimuli generated with the roles of the two models reversed, so as to generate counterexamples for the other model.

*L*

_{2}is placed at the center of a background of uniform luminance

*L*

_{1}. The perceptual quantity

*q*(

*L*

_{1},

*L*

_{2}) is the perceived contrast between the foreground and the background. These stimuli live in a two-dimensional parameter space, specified by the pair [

*L*

_{1},

*L*

_{2}]. This allows us to depict the problem and solution graphically. Suppose that the maximal and minimal luminance values allowed in the experiment are

*L*

_{max}and

*L*

_{min}, respectively. Also assume that the foreground luminance is always higher than the background, i.e.,

*L*

_{2}>

*L*

_{1}. Then the entire stimulus space can be depicted as a triangular region in a two-dimensional coordinate system defined by

*L*

_{1}and

*L*

_{2}, as shown in Figure 3B.

*q*. The first model states that the perceived contrast is determined by the difference between the foreground and the background luminances, i.e.,

*M*

_{1}=

*L*

_{2}−

*L*

_{1}. In the second model, the perceived contrast is determined by the ratio between the luminance difference and the background luminance, i.e.,

*M*

_{2}= (

*L*

_{2}−

*L*

_{1})/

*L*

_{1}.

*M*

_{1}and

*M*

_{2}are always straight lines, which are plotted in Figures 5A and 5B, respectively. The fact that the level sets of the two models are generally not parallel implies that subsets of images producing the same value in one model (i.e., lying along a contour line of that model) will produce different values in the other model.

*A*= [

*L*

_{1}

^{ i},

*L*

_{2}

^{ i}]. A pair of stimuli with matching

*M*

_{1}but extremal values of

*M*

_{2}is given by

*M*

_{2}but extremal

*M*

_{1}values is

*B,*

*C*) and (

*D,*

*E*) are subject to visual inspection using a 2AFC method, i.e., the subjects are asked to pick one stimulus from each pair that appears to have higher contrast. This procedure is repeated with different initial points

*A*.

*M*

_{2}is a better model than

*M*

_{1}, then the perceived contrast between the (

*D,*

*E*) pairs should be harder to distinguish than the (

*B,*

*C*) pairs. In other words, in the 2AFC test, the percentage of choosing either

*D*or

*E*should be closer to 50%, whereas the percentage of choosing either

*B*or

*C*should be closer to 0% or 100%. In some cases, there may not be a clear winner. For example, if the perceived contrasts between

*B*and

*C*and between

*D*and

*E*are both highly distinguishable, then neither model would provide a good prediction of the visual perception of contrast. In other words, the stimuli generated to extremize one model serve to falsify the other (although their relative degrees of failure may still be different and measurable with MAD competition).

*Z*-scores. The mean opinion score and the standard deviation between subjective scores were computed for each image. The video quality experts group ( www.vqeg.org) has suggested several evaluation criteria to assess the performance of objective image quality models. These criteria include linear correlation coefficient after non-linear regression, linear correlation coefficient after variance-weighted non-linear regression, rank-order correlation coefficient, and outlier ratio. Details about the evaluation procedure can be found in VQEG (2000). It has been reported in Wang et al. (2004) that the SSIM index significantly outperforms the MSE for the LIVE database, based on these criteria. However, as mentioned earlier, it may not be appropriate to draw strong conclusions from these tests, because the space of images is so vast that even a database containing thousands or millions of images will not be sufficient to adequately cover it. Specifically, the LIVE database is limited in both the number of full-quality reference images and in the number and level of distortion types.

*M*

_{1}(i.e., set of all images having the same value of

*M*

_{1}) as well as a level set of

*M*

_{2}, each containing the initial image. Starting from the initial image, we iteratively move along the

*M*

_{1}level set in the direction in which

*M*

_{2}is maximally increasing/decreasing. The iteration continues until a maximum/minimum

*M*

_{2}image is reached. Figure 7 also demonstrates the reverse procedure for finding the maximum/minimum

*M*

_{1}images along the

*M*

_{2}level set. The maximally increasing/decreasing directions may be computed from the gradients of the two image quality metrics, as described in 3. This gradient descent/ascent procedure does not guarantee that we will reach the global minimum/maximum on the level set (i.e., we may get “stuck” in a local minimum). As such, a negative result (i.e., the two images are indiscriminable) may not be meaningful. Nevertheless, a positive result may be interpreted unambiguously.

*σ*

_{ l}

^{2}determines the initial distortion level. Specifically, we let

*σ*

_{ l}

^{2}= 2

^{ l}for

*l*= 0, 1, 2, …, 9, respectively. For each noise level, we generate four test images (minimum/maximum MSE with the same SSIM and minimum/maximum SSIM with the same MSE) using the iterative constrained gradient ascent/descent procedure described in 3. Sample synthesized images are shown in Figure 9.

*prove*a model to be correct: it only offers an efficient means of selecting stimuli that are likely to

*falsify*it. As such, it should be viewed as complementary to, rather than a replacement for, the conventional direct method for model evaluation, which typically aims to explore a much larger portion of the stimulus space. Second, depending on the specific discriminable quantity and the competing models, the computational complexity of generating the stimuli can be quite significant, possibly prohibitive. The constrained gradient ascent/descent algorithms described in 3 assume that both competing models are differentiable and that their gradients may be efficiently computed (these assumptions hold for the models used in our current experiments). Third, if the search space of the best MAD stimulus is not concave/convex, then the constraint gradient ascent/descent procedure may converge to local maxima/minima. More advanced search strategies may be used to partially overcome this problem, but they typically are more computationally costly, and still do not offer guarantees of global optimality. Nevertheless, the locally optimal MAD stimuli may be sufficient to distinguish the two competing models. Specifically, if the generated stimuli are discriminable, then they will still serve to falsify the model that scores them as equivalent. Fourth, MAD-generated stimuli may be highly unnatural, and one might conclude from this that the application scope of one or both models should be restricted. Finally, there might be cases where the extremal stimuli of each model succeed in falsifying the other model. Alternatively, each of the models could be falsified by the other in a different region of the stimulus space. In such cases, we may not be able to reach a conclusion that one model is better than the other. However, such double-failure results in MAD competition are still valuable because they can reveal the weaknesses of both models and may suggest potential improvements.

**x**and

**y**be column vector representations of two image patches (e.g., 8 × 8 windows) extracted from the same spatial location from images

**X**and

**Y**, respectively. Let

*μ*

_{x},

*σ*

_{x}

^{2}, and

*σ*

_{xy}represent the sample mean of the components of

**x**, the sample variance of

**x**, and the sample covariance of

**x**and

**y**, respectively:

*N*

_{P}is the number of pixels in the local image patch and

**1**is a vector with all entries equaling 1. The SSIM index between

**x**and

**y**is defined as

*C*

_{1}and

*C*

_{2}are small constants given by

*C*

_{1}= (

*K*

_{1}

*R*)

^{2}and

*C*

_{2}= (

*K*

_{2}

*R*)

^{2}, respectively. Here,

*R*is the dynamic range of the pixel values (e.g.,

*R*= 255 for 8 bits/pixel grayscale images), and

*K*

_{1}≪ 1 and

*K*

_{2}≪ 1 are two scalar constants (

*K*

_{1}= 0.01 and

*K*

_{2}= 0.03 in the current implementation of SSIM). It can be easily shown that the SSIM index achieves its maximum value of 1 if and only if the two image patches

**x**and

**y**being compared are exactly the same.

**x**

_{i}and

**y**

_{i}are the

*i*th sampling sliding windows in images

**X**and

**Y**, respectively,

*W*(

**x**

_{i},

**y**

_{i}) is the weight given to the

*i*th sampling window, and

*N*

_{S}is the total number of sampling windows.

*N*

_{S}is generally smaller than the number of image pixels

*N*

_{I}to avoid the sampling window exceed the boundaries of the image. The original implementations of the SSIM measure corresponds to the case of uniform pooling, where

*W*(

**x**,

**y**) ≡ 1. In Wang and Shang (2006), it was shown that a local information content-weighted pooling method can lead to consistent improvement for the image quality prediction of the LIVE database, where the weighting function is defined as

**Y**, we have

**1**denotes a column vector with all entries equaling 1. For the case that

*W*(

**x**,

**y**) ≡ 1, we have

_{ Y}

*W*(

**x**,

**y**) ≡ 0 in Equations B3 and B4. Therefore,

_{ Y}

*S*(

**X**,

**Y**) can be calculated by combining Equations A3, B2, B3, B4, B6, and B8.

*M*

_{2}while constrained on the

*M*

_{1}level set. We represent images as column vectors, in which each entry represents the grayscale value of one pixel. Denote the reference image

**X**and the synthesized image at the

*n*th iteration

**Y**

_{ n}(with

**Y**

_{0}representing the initial image). We compute the gradient of the two image quality models (see 2), evaluated at

**Y**

_{ n}:

**G**

_{ n}, by projecting out the component of

**G**

_{2, n}, that lies in the direction of

**G**

_{1, n}:

*M*

_{1}is evaluated at

**Y**′

_{ n}, and an appropriate amount of this vector is added in order to guarantee that the new image has the correct value of

*M*

_{1}:

*ν*is straightforward, but in general it might require a one-dimensional (line) search.

*λ*is used to control the speed of convergence and

*ν*must be adjusted dynamically so that the resulting vector does not deviate from the level set of

*M*

_{1}. The iteration continues until the image satisfies certain convergence condition, e.g., mean squared change in the synthesized image in two consecutive iterations is less than some threshold. If metric

*M*

_{2}is differentiable, then this procedure will converge to a local maximum/minimum of

*M*

_{2}. In general, however, we have no guaranteed means of finding the global maximum/minimum (note that the dimension of the search space is equal to the number of pixels in the image), unless the image quality model satisfies certain properties (e.g., convexity or concavity). In practice, there may be some additional constraints that need to be imposed during the iterations. For example, for 8 bits/pixel grayscale images, we may need to limit the pixel values to lie between 0 and 255.