Over the course of five experiments, we used subject responses to select a set of 500 candidate images starting from 31,500 randomly selected images. Participants were shown images one at a time and were asked to classify them as high gloss or low gloss by pressing one of two keys. In addition, they could flag an image using the space bar if they found there was no recognizable object in the image. Every participant saw 1500 images. In the first round, we showed 15,000 images selected randomly from the overall set, divided among 10 subjects (eight female, two male; mean age ± SD, 24.8 ± 4.8 years), so each subject saw 1500 images (750 from each ground-truth material), and each image was judged by one subject. For the second round, we removed all images that were flagged as unrecognizable. We selected all of the remaining images that were classified incorrectly (587 low-gloss and 1817 high-gloss images) plus correctly judged images to total 2250 images from each ground truth category (1663 and 433 correctly judged low- and high-gloss images, respectively). These were judged by 15 participants (12 female, three male; mean age ± SD, 23.7 ± 3.8 years). Again, every participant saw 750 images from each ground-truth material, resulting in five judgments per image. These results were combined with the classifications of these images from the first round. From these results—six binary judgments on each image—we divided the images from each ground truth into seven bins according to the mean responses. For ground-truth high-gloss images, we picked 750 images—107 from each bin and 108 from the most incorrectly judged bin. For ground-truth low-gloss images there were not enough images in each bin to pick the same amount. Where this was the case, we picked all images from that bin and added the difference between the actual bin size and the target number of images to the target number for the next bin. We performed this procedure starting with the bin of most incorrectly judged images. The resulting set of 1500 images was then judged again by four participants (three female, one male; mean age ± SD, 22.5 ± 2.1 years), which resulted in 10 classifications per image after combining these results with those of the first two rounds.
Because the number of incorrectly perceived high-gloss images was much larger than that of low-gloss images, we repeated the search progress. This time we tested in two stages. In the first stage, we showed 16,500 images (12,000 high gloss and 4500 low gloss) to 16 subjects (14 female, two male; mean age ± SD, 23.8 ± 3.2 years)—1500 images each (750 low gloss and 750 high gloss), resulting in one classification response per low-gloss image. High-gloss images were included to balance the stimulus set, but the data were not used to identify candidate images. These were shown to several subjects, whereas low-gloss images were seen by only one subject each. For the second stage, we again removed all images that were flagged as unrecognizable and from the remainder took 750 low-gloss and 750 high-gloss images (favoring incorrectly judged low-gloss images) and tested nine more subjects (six female, three male; mean age ± SD, 23.9 ± 2.9 years) on these 1500 images, resulting in 10 binary judgments for each low-gloss image. We did not use the high-gloss images from this experiment because we already had enough to fill our diagnostic set from the first set of experiments.
The images from the final stage of the first set of experiments and the low-gloss images from the final stage of the second set of experiments were combined and divided into five bins ranging from “seen as low gloss” to “seen as high gloss.” We picked 50 images from each ground-truth material per bin, except for the most “seen as high gloss” bin, where there was only one ground-truth low-gloss image. We filled this bin up with ground-truth high-gloss images. These 500 candidate images were the set we used in the crowd sourcing experiment. These images and the images of the same scenes from the other material category were withheld from training the classifiers, leaving 148,922 images for training and validation.