We compared the framework’s performance on the
snag dataset containing general-domain images with the performance on the
derm dataset containing images from the domain of dermatology (
Vaidyanathan et al., 2016). This dataset consists of SNAG data for 26 dermatologists inspecting 29 dermatology images.
Figure 21 shows our data set-up, an example of a transcribed narrative, and gaze data for this dataset. The data collection set-up was similar to the
snag dataset. In this case, when using the adapted Master-Apprentice method, the experimenter functioned as an “apprentice” to elicit rich descriptions from the dermatologist. The dermatologists were instructed to “examine each image while moving toward a diagnosis and describe it aloud as if tutoring the experimenter.” The descriptions in this dataset usually included differential diagnosis, final diagnosis, and a self-estimated certainty of the final diagnosis. Again, dermatologists have specific shared terminology to refer to the morphology they describe; therefore, a manual annotation for this dataset was provided by an expert dermatologist using the RegionLabeler tool. More details regarding this dataset can be found in (
Vaidyanathan et al., 2016). For the comparison, we only considered the results from alignment framework that used MSFC and
k-means with
k = 4. Interestingly, recall values are higher for the
derm dataset when compared with the
snag dataset. Recall values indicate the number of alignment pairs in the reference alignments that are also obtained in the framework’s output alignments. One possible reason for high recall values could be that, as a result of task instructions, the
derm dataset has a precise and limited vocabulary. Owing to the nature of the dermatology field, most of the regions in the images usually correspond with exactly one label. On the other hand, owing to the general-domain nature of the images in the
snag dataset, many objects in the images correspond to various labels. For example, for the woman in the image in
Figure 17, observers mentioned the labels
lady,
woman, and
female. Thus, labels that were not mentioned by majority of the observers will have low probability of being associated with the corresponding image region leading to low recall values.
Table 9 shows the average precision and recall values for the two datasets for the two segmentation methods.