August 2012
Volume 12, Issue 9
Vision Sciences Society Annual Meeting Abstract  |   August 2012
Scene Understanding for the Visually Impaired Using Visual Sonification by Visual Feature Analysis and Auditory Signatures
Author Affiliations
  • Jason Clemons
    University Of Michigan
  • Yingze Bao
    University Of Michigan
  • Mohit Bagra
    University Of Michigan
  • Todd Austin
    University Of Michigan
  • Silvio Savarese
    University Of Michigan
Journal of Vision August 2012, Vol.12, 804. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Jason Clemons, Yingze Bao, Mohit Bagra, Todd Austin, Silvio Savarese; Scene Understanding for the Visually Impaired Using Visual Sonification by Visual Feature Analysis and Auditory Signatures. Journal of Vision 2012;12(9):804. doi:

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

The World Health Organization estimates that approximately 2.6% of the human population is visually impaired, with 0.6% being totally blind. In this research we propose to use visual sonification as a means to assist the visually impaired. Visual sonification is the process of transforming visual data into sounds - a process which would non-invasively allow blind persons to distinguish different objects in their surroundings using their sense of hearing. The approach, while non-invasive, creates a number of research challenges. Foremost, the ear is a much lower bandwidth interface than the optical nerves or a cortical interface (roughly 150k bps vs. 10M bps). Rather than converting visual inputs into a list of object labels (e.g., "car", "phone") as traditional visual aids systems do, we conjecture a paradigm where visual abstractions are directly transformed into auditory signatures. These signatures provide a rich characterization of object in the surroundings and can be efficiently transmitted to the user. This process leverages users' capabilities to learn and adapt to the auditory signatures over time. In this study we propose to obtain visual abstractions by using a popular representation in computer vision called bag-of-visual-words (BoW). In a BoW representation, an object category is modeled as a histogram of epitomic features (or visual words) that appear in the image and are created during an a-priori off-line learning phase. The histogram is than directly converted into an audio signature using a suitable modulation scheme. Our experiments demonstrate that humans are capable of successfully discriminating audio signatures associated to different visual categories (e.g., cars, phones) or object properties (front view, side view, far) following a short training procedure. Critically, our study shows that there exists a tradeoff between the complexity of representation (number of visual words used to form the histogram) and classification accuracy by humans.

Meeting abstract presented at VSS 2012


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.