October 2020
Volume 20, Issue 11
Open Access
Vision Sciences Society Annual Meeting Abstract  |   October 2020
Semantic embeddings of verbal descriptions predict action similarity judgments
Author Affiliations & Notes
  • Leyla Tarhan
    Harvard University
  • Julian de Freitas
    Harvard University
  • George Alvarez
    Harvard University
  • Talia Konkle
    Harvard University
  • Footnotes
    Acknowledgements  Funding for this project was provided by NSF grant DGE1144152 to L.T., the Harvard University Norman Anderson Fund Grant to L.T., and the Star Family Challenge Grant to T.K.
Journal of Vision October 2020, Vol.20, 1241. doi:https://doi.org/10.1167/jov.20.11.1241
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Leyla Tarhan, Julian de Freitas, George Alvarez, Talia Konkle; Semantic embeddings of verbal descriptions predict action similarity judgments. Journal of Vision 2020;20(11):1241. https://doi.org/10.1167/jov.20.11.1241.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Every day, we see people perform many different actions, some of which naturally seem more similar (e.g. running and walking) while others are more different (e.g. running and cooking). What properties of actions capture this intuitive perception of similarity? To address this question, we asked sixteen participants to watch 60 videos depicting everyday actions, then place intuitively similar actions closer together in space (Kriegeskorte & Mur, 2012). We then tested how well a range of features predicted these action similarities, spanning low-level shape features (gist; Oliva & Torralba, 2001), intermediate-level features capturing the body parts involved in the actions and the actions’ targets in the world (Tarhan & Konkle, 2019), and high-level semantics. To operationalize high-level semantics, we used verbal descriptions to extract each video’s embedding in a neural network feature space. First, 3 observers verbally described each video. Then, transcripts of their descriptions were passed through a deep neural network model trained on natural language processing (BERT; Devlin et al., 2019), producing a 1024-dimensional feature code for each description. Gist features did not predict action similarity judgments well (mean leave-1-out tau-a = 0.07), while body part and action target features performed better (tau-a = 0.2 and 0.12, respectively). Notably, semantic BERT features performed best (tau-a = 0.25), approaching the noise ceiling. We replicated these results in a second group of participants (N = 20). These findings suggest that humans perceive action similarity primarily in terms of semantic properties that can be extracted from natural language. This may be because verbal descriptions capture actions’ larger event structures, going beyond the bodies and objects that make up their elemental components. In addition, our use of verbal descriptions and natural language processing models introduces a tractable way to measure semantic features for real-world, complex videos.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.