Purchase this article with an account.
Michael Frank, Avril Kenney, Noah Goodman, Joshua Tenenbaum, Antonio Torralba, Aude Oliva; Predicting object and scene descriptions with an information-theoretic model of pragmatics. Journal of Vision 2010;10(7):1241. doi: https://doi.org/10.1167/10.7.1241.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
A picture may be worth a thousand words, but its description will likely use far fewer. How do speakers choose which aspects of a complex image to describe? Grice's pragmatic maxims (e.g., “be relevant”, “be informative”) have served as an informal guide for understanding how speakers select which pieces of information to include in descriptions. We present a formalization of Grice's maxim of informativeness (“choose descriptions proportional to the number of bits they convey about the referent with respect to context”) and test its ability to capture human performance.
Experiment 1: Participants saw sets of four simple objects that varied on two dimensions (e.g., texture and shape) and were asked to provide the relative probabilities of using two different adjectives (e.g., polka-dot vs. square) to describe a target object relative to the distractor objects. Participants' mean probabilities were highly correlated with the information theoretic model’s predictions for the relative informativeness of the two adjectives (r=.92,p<.0001).
Experiment 2: Participants described street scenes from a database of hand-segmented and labeled images (LabelMe). In the context condition, participants were presented with a set of six scenes, one target and five distractors, and were asked to name five objects in the target scene so that another observer could pick that scene out of the set. In the no-context condition, another group of participants performed the same task without seeing the distractors. The information-theoretic model was strongly correlated with differences in object labeling between context and no-context conditions (r=.67,p<.0001), suggesting that the model captures the effect of context on descriptor choice.
Our results suggest that speakers' image descriptions conform to optimal pragmatic norms and that information theory can define norms for the linguistic compression of visual information. This constitutes a first step towards understanding how the visual world is captured and communicated by language users.
This PDF is available to Subscribers Only