September 2021
Volume 21, Issue 9
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2021
Co-occurrence statistics from vision and language capture thematic relationships between objects
Author Affiliations & Notes
  • Elizabeth Hall
    University of California, Davis
  • Joy Geng
    University of California, Davis
  • Footnotes
    Acknowledgements  This work was supported by NIH grant NIH-RO1-MH113855-01 to JG. EHH is supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program.
Journal of Vision September 2021, Vol.21, 2779. doi:https://doi.org/10.1167/jov.21.9.2779
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Elizabeth Hall, Joy Geng; Co-occurrence statistics from vision and language capture thematic relationships between objects. Journal of Vision 2021;21(9):2779. doi: https://doi.org/10.1167/jov.21.9.2779.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Thematic relationships have been defined as the grouping of objects together by virtue of their complementary roles in the same scenario or event. Importantly, thematic thinking has been shown to occur implicitly, and can have a strong influence on a variety of cognitive behaviors, including the allocation of visual attention. Here we examined the extent to which an unsupervised machine-learning algorithm trained on object co-occurrence statistics could capture thematic relationships between objects within events. We asked Mturk workers (n=240) to list the most common objects found in 24 events ( e.g. “a child’s birthday party”). We then trained several models with a common continuous-bag-of-word (CBOW) architecture, but with different training corpora. The primary question of interest was whether human ratings for objects belonging to common themes were better described by training on visual scenes or text description. Vision-based models were trained on a database of over 22k densely segmented real-world and photorealistic scenes, and captured the frequency of co-occurrence between objects in visual scenes. Language-based models were trained on a database of over 2.6 billion websites, Wikipedia pages, and news articles, that captured the frequency of co-occurrence between objects in written descriptions. For each model, we ranked each target object in an event by the strength of the similarity between the object’s word vector and the vector representation of the event. We found that object rankings provided by language-based models were more strongly correlated with human rankings of objects than the rankings provided by image-based models, though there was unique variability in what scenarios language vs. vision models performed better on. Together these findings reveal the need to reexamine the way in which we define thematic relationships, and point towards the importance of understanding the impact of both visual and textual inputs.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×