Human observers were asked to judge whether or not 720 individual probe locations contained a highlight. These probe locations (single pixels) were picked based on ground truth and a threshold model that used only an intensity threshold to predict highlights.
Figure 2 illustrates the probe location selection process.
Mean responses from Experiment 1 grouped by the categories of probe locations are shown in
Figure 4a. Results from the first experiment were later used as a target for our pruning algorithm and are referred to as the
target set. Mean responses from Experiment 2 are shown in
Figure 4b. The probe locations in this experiment were selected from a different set of images according to the same criteria as the target set. Results from the second experiment were used for validation and are referred to as the
validation set. Note that the purpose of the validation set was not to validate human behavior in the target set, but rather to validate how well the model's imitation of human behavior generalizes to a new set of stimuli and locations. We therefore do not test for behavioral differences between the two human response sets.
Results show, as expected, that category (a) pixels, which contain highlights and are brighter than the threshold (see Methods), are most likely to be classified as highlights. Similarly, category (d) pixels, which do not have a highlight and are darker than the threshold, are least likely to be classified as highlights. Interestingly, pixels from categories (b) and (c), which either contain a highlight but are darker than the threshold or contain no highlight but are brighter than the threshold, are on average similarly likely to be judged as a highlight. This suggests that sheer relative pixel intensity does have an impact on human highlight perception, but that further factors play a role.
The pattern of results for the four pixel categories is very similar for both experiments. It shows that human observers perceived highlights in our stimuli and that they were able to interpret and respond to single pixel probe locations. Both ground-truth and threshold predictions seem to partially predict mean human responses equally well (correlation to mean human responses
r = 0.57 and
r = 0.58 for the target dataset and
r = 0.51 and
r = 0.49 for the validation dataset, respectively). As a comparison we calculated the intercorrelation among human observers. Because human responses were binary, we randomly divided the observer group in two 10,000 times, correlating the mean responses of the two groups every time. The maximum correlations we observed were
r = 0.82 for the target dataset and
r = 0.69 for the validation dataset (mean correlations were
r = 0.73 and
r = 0.57, respectively).
Figure 5 shows the distributions of these human-to-human correlations. A large proportion of the variance in human responses remains unexplained by other observers.
To test whether this is due to idiosyncratic response behavior or simply noisy responses, we compared inter-rater and intra-rater agreement. Because the responses of individual observers are binary, we could not calculate this as a correlation and chose to use the rate of agreement as a measure of consistency. To calculate inter-rater consistency, we defined comparable pixels according to the same pixel categories (a) to (d) (
Figure 2), same image texture category, and same surface geometry scale. We split each group of comparable pixels in half randomly, thus splitting the entire target set in half with a comparable counterpart for each pixel in the two halves. We calculated intra-rater consistency as the rate of agreement between an individual's responses to comparable pixels in the two halves and inter-rater consistency as the rate of agreement between an individual's responses and other individual's responses to comparable pixels. We repeated this process 1000 times to get an estimate of the inter- and intra-rater consistency. A paired
t-test of the per-subject means of these two consistency distributions showed a significant difference,
t(12) = 6.62,
p < 0.001, with intra-rater consistencies higher than inter-rater consistencies (mean ±
SD, 0.71 ± 0.06 vs. 0.61 ± 0.07, respectively). As a measure of the effect size, Cohen's
d = 1.84. The same analysis for the validation set responses also revealed a significant difference in the same direction,
t(14) = 6.39,
p < 0.001, 0.70 ± 0.07 vs. 0.58 ± 0.04, Cohen's
d = 1.65. This indicates that variance in human responses that could not be explained by human-to-human correlations or our model predictions is not just due to noise but also to idiosyncratic behavior.