Journal of Vision Cover Image for Volume 16, Issue 12
August 2016
Volume 16, Issue 12
Open Access
Vision Sciences Society Annual Meeting Abstract  |   September 2016
A Bayesian Model of Visual Question Answering
Author Affiliations
  • Christopher Kanan
    Chester F. Carlson Center for Imaging Science, College of Science, Rochester Institute of Technology
  • Kushal Kafle
    Chester F. Carlson Center for Imaging Science, College of Science, Rochester Institute of Technology
Journal of Vision September 2016, Vol.16, 332. doi:https://doi.org/10.1167/16.12.332
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Christopher Kanan, Kushal Kafle; A Bayesian Model of Visual Question Answering. Journal of Vision 2016;16(12):332. https://doi.org/10.1167/16.12.332.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Visual Question Answering (VQA) is a new problem in computer vision and natural language processing, with the first three English language datasets for VQA being released in the past year: DAQUAR, The VQA Dataset, and COCO-QA. In VQA, a model is given an image and a text-based question about a scene, and the model has to answer the question. The key insight in our model for VQA is that we can predict the form of the answer from the question. We formulate our model in a Bayesian framework, which has three multiplicative terms. The first predicts the form of the answer expected based on the question, e.g., if the question is, "What color is the bear?" it would model the probability of it being a color question. The second predicts the probability of the answer given the answer-type and the question. The third term predicts the probability of the observed visual features given the answer, answer-type, and the question. Our model shares similarities with visual attention, in that it directly models that the image features that should be attended directly depend on the task (question). We used a deep convolutional neural network for visual features, and use skip-thoughts to encode the questions. Our model achieved state-of-the-art results on all three benchmark datasets for open-ended VQA on real images beating the previous best by 14.9% (in relative terms) in the best case. While our results are good, there is still substantial room for improvement compared to humans when they are given the same images and questions.

Meeting abstract presented at VSS 2016

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×