Purchase this article with an account.
Jason Clemons, Yingze Bao, Mohit Bagra, Todd Austin, Silvio Savarese; Scene Understanding for the Visually Impaired Using Visual Sonification by Visual Feature Analysis and Auditory Signatures. Journal of Vision 2012;12(9):804. doi: 10.1167/12.9.804.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
The World Health Organization estimates that approximately 2.6% of the human population is visually impaired, with 0.6% being totally blind. In this research we propose to use visual sonification as a means to assist the visually impaired. Visual sonification is the process of transforming visual data into sounds - a process which would non-invasively allow blind persons to distinguish different objects in their surroundings using their sense of hearing. The approach, while non-invasive, creates a number of research challenges. Foremost, the ear is a much lower bandwidth interface than the optical nerves or a cortical interface (roughly 150k bps vs. 10M bps). Rather than converting visual inputs into a list of object labels (e.g., "car", "phone") as traditional visual aids systems do, we conjecture a paradigm where visual abstractions are directly transformed into auditory signatures. These signatures provide a rich characterization of object in the surroundings and can be efficiently transmitted to the user. This process leverages users' capabilities to learn and adapt to the auditory signatures over time. In this study we propose to obtain visual abstractions by using a popular representation in computer vision called bag-of-visual-words (BoW). In a BoW representation, an object category is modeled as a histogram of epitomic features (or visual words) that appear in the image and are created during an a-priori off-line learning phase. The histogram is than directly converted into an audio signature using a suitable modulation scheme. Our experiments demonstrate that humans are capable of successfully discriminating audio signatures associated to different visual categories (e.g., cars, phones) or object properties (front view, side view, far) following a short training procedure. Critically, our study shows that there exists a tradeoff between the complexity of representation (number of visual words used to form the histogram) and classification accuracy by humans.
Meeting abstract presented at VSS 2012
This PDF is available to Subscribers Only