Abstract
Thematic relationships have been defined as the grouping of objects together by virtue of their complementary roles in the same scenario or event. Importantly, thematic thinking has been shown to occur implicitly, and can have a strong influence on a variety of cognitive behaviors, including the allocation of visual attention. Here we examined the extent to which an unsupervised machine-learning algorithm trained on object co-occurrence statistics could capture thematic relationships between objects within events. We asked Mturk workers (n=240) to list the most common objects found in 24 events ( e.g. “a child’s birthday party”). We then trained several models with a common continuous-bag-of-word (CBOW) architecture, but with different training corpora. The primary question of interest was whether human ratings for objects belonging to common themes were better described by training on visual scenes or text description. Vision-based models were trained on a database of over 22k densely segmented real-world and photorealistic scenes, and captured the frequency of co-occurrence between objects in visual scenes. Language-based models were trained on a database of over 2.6 billion websites, Wikipedia pages, and news articles, that captured the frequency of co-occurrence between objects in written descriptions. For each model, we ranked each target object in an event by the strength of the similarity between the object’s word vector and the vector representation of the event. We found that object rankings provided by language-based models were more strongly correlated with human rankings of objects than the rankings provided by image-based models, though there was unique variability in what scenarios language vs. vision models performed better on. Together these findings reveal the need to reexamine the way in which we define thematic relationships, and point towards the importance of understanding the impact of both visual and textual inputs.