The stimuli were shown side-by-side against a dark gray background on an Eizo ColorEdge CG243W display (24”, 1,920 × 1,200 pixel, 60 Hz) connected to an Nvidia Quadro 600 graphics card. The experiment was programmed in Processing 4.0 (
https://processing.org).
The fixed standard stimulus was always presented on the left side of the screen, the adjustable test stimulus on the right side. In the first subexperiment, the standard object was always rendered with the Fresnel-BRDF and the test object with the Ward-BRDF, but both had the same depth scaling. In the second subexperiment, the standard object always had a depth scaling of 1.1, whereas the depth scaling of the test object was selected from the five different levels of the depth factor.
At the start of each trial, both objects had nominally the same reflection strength (for the Ward-BRDF, the transformation from IOR to ρs described earlier was used to approximate this). The participants first rated the gloss impression of the standard object on a scale from 0 (absent) to 9 (very high). They then did the same for the test object. Next, they had to rate the relative depth of the standard and the test object on a scale with seven steps, with the center position corresponding to “equal.” For both directions (“left” vs. “right”), an unequal depth could be further specified with the labels “somewhat larger,” “larger,” and “much larger.” The current settings in all three tasks were reflected in graphical scales shown below the stimuli. After the third rating was finished, the scales disappeared and the participants had to best match the gloss impression in both objects by adjusting the reflection strength of the test stimulus. Possible values ranged from IOR = 1 to IOR = 3 in 100 equal-distant steps (in cases where the Ward-BRDF was adjusted, the corresponding transformed values ρs for the Ward-BRDF were used). Because a perfect match is usually not achievable, the last task in each trial was to rate the quality of the best achievable match on a scale with the six levels “impossible,” “bad,” “somewhat bad,” “good,” “very good,” and “perfect.” The number of levels in each of the rating scales was selected based on the expected accuracy of the assessment.