Abstract
The rise of large-scale image databases has accelerated productivity in both human and machine vision communities. Most extant databases were created in three phases: (1) Obtaining a comprehensive list of categories to sample; (2) Scraping images from the web; (3) Verifying category labels through crowdsourcing. Subtle biases can arise in each stage: offensive labels can get reified as categories; images represent what is typical of the internet, rather than what is typical of daily experience, and verification is dependent on the knowledge and cultural competence of the annotators that provide “ground truth” labels. Here, we describe two studies that examine the bias in extant visual databases and the deep neural networks trained from them. 66 observers took part in an experience sampling experiment via text message. Each received 10 messages per day at random intervals for 30 days, and sent a picture of their surroundings if possible (N=6280 images). Category predictions were obtained from CNNs pretrained on the Places database. The dCNNs showed poor classification performance for these images. A second study investigated cultural biases. We scraped images of private homes from Airbnb from 219 countries. Pre-trained deep neural networks were less accurate and less confident in recognizing images from the Global South. We observed significant correlations between dCNN confidence and GDP per capita (r=0.30) and literacy rate (r=0.29). These studies show a dissociation between lived visual content and web-based content, and suggest caution when using the internet as a proxy for visual experience.