Abstract
Visual Question Answering (VQA) is a new problem in computer vision and natural language processing, with the first three English language datasets for VQA being released in the past year: DAQUAR, The VQA Dataset, and COCO-QA. In VQA, a model is given an image and a text-based question about a scene, and the model has to answer the question. The key insight in our model for VQA is that we can predict the form of the answer from the question. We formulate our model in a Bayesian framework, which has three multiplicative terms. The first predicts the form of the answer expected based on the question, e.g., if the question is, "What color is the bear?" it would model the probability of it being a color question. The second predicts the probability of the answer given the answer-type and the question. The third term predicts the probability of the observed visual features given the answer, answer-type, and the question. Our model shares similarities with visual attention, in that it directly models that the image features that should be attended directly depend on the task (question). We used a deep convolutional neural network for visual features, and use skip-thoughts to encode the questions. Our model achieved state-of-the-art results on all three benchmark datasets for open-ended VQA on real images beating the previous best by 14.9% (in relative terms) in the best case. While our results are good, there is still substantial room for improvement compared to humans when they are given the same images and questions.
Meeting abstract presented at VSS 2016