Purchase this article with an account.
James Hays, Alexei Efros; Where in the world? Human and computer geolocation of images. Journal of Vision 2009;9(8):969. doi: 10.1167/9.8.969.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
In this work we measure how accurately humans can localize arbitrary photographs on the Earth and contrast this against a baseline computational method.
Previous work has studied the placement of scenes into semantic categories (e.g. kitchen, bedroom, forest, etc...) both by humans and computers. With moderate numbers of categories, simple texture-based methods can group scenes almost as well as humans (Renninger 2004, Oliva 2005). The success of computational methods is not a result of any high-level understanding of scenes, but rather the ease of which these hand-defined categories can be separated by low-level features.
In this study we examine human performance at organizing scenes according to geographic location on the Earth rather than hand-defined semantic category. Participants are shown novel images and asked to pick the location on a globe where the photograph was taken. This task is difficult - many scenes are geographically ambiguous while others require high-level scene understanding and knowledge of cultural or architectural trends across the Earth. On the other hand, photographs of landmarks are easy to geolocate for both humans and computers.
We compare and contrast human performance with a data-driven computational method using 6.5 million geolocated photographs. For a novel photograph, the algorithm finds the most similar scenes according to the scene gist descriptor, texton histogram, and other features. A voting scheme produces a geolocation estimate from the locations of matching scenes.
Image geolocation is one of few high-level visual tasks where computational methods are competitive with humans. While humans are superior at using high-level scene information (e.g. traffic direction, text language, tropical flora, etc...) our computational method has a geolocated visual memory larger than almost any human. We break down the performance of humans and computers according to scene type and analyze the situations in which humans and computers are disparate in performance.
This PDF is available to Subscribers Only