Abstract
Deep neural network (DNN)-based Face recognition (FR) models have improved greatly over the past decades achieving, or even exceeding, human-level accuracies under certain viewing conditions, such as frontal face views. However, as we reported in last year’s meeting (XXX et al., 2023), under challenging viewing conditions (e.g. large distances, non-frontal regard) humans outperform DNNs. To shed light on potential explanations for these differences in FR accuracies of humans and DNNs, we turned to eye-tracking paradigms to discern potentially important zones of information uptake for observers, and compare them with DNN-derived saliency maps. Despite the conceptual similarity between human eye tracking-based heat-maps and DNN saliency maps, the literature is sparse in terms of strategic efforts to quantitatively compare the two and translate human gaze and attention strategies to improve machine performance. We obtained gaze-contingent (GC) human eye-tracking heatmaps and DNN saliency maps, for faces, under three stimulus conditions: filtered for low-spatial frequency, high-spatial frequency, and full-resolution images. Human participants saw two sequentially presented faces and were asked to determine whether the individuals depicted were siblings (images from Vieira et. al., 2014) or two images of the same person (Stirling face database). While human eye-tracking heatmaps were collected during each occurrence of face images (sibling/stirling), DNN saliency maps were realized from differences in similarity score between the machine-interpreted face embeddings of pairs of face images using an efficient correlation-based explainable AI approach. We present the characterization and comparison of humans’ and DNN’s usage of the spatial frequency information in faces, and propose a model-agnostic translation strategy for improved face recognition performance utilizing an efficient training approach to bring DNN saliency maps into closer register with human eye-tracking heatmaps.