Projects | BeyondSight

Inspiration.

BeyondSight is a web app that provides real time audio describing the user's surroundings. Currently, there are two types of audio generated. The first is on-device detection which provides instantaneous audio of identified objects and their locations (left/right/centre). Simplistic description provides an efficient way to for the individual to quickly understand their surroundings and avoid collisions. The second is AI generated description of the surroundings. The more detailed audio allows the vision impaired individual to appreciate the beauty around them, for example, basking in the warmth of sunshine upon hearing that it is a beautiful sunny day outside.

Projects

Contact me

Lyra.

HomePage

About me

About

BeyondSight.

- Real-time object detection: Identifies and announces objects like people, furniture, vehicles, and potential hazards. Contextual scene descriptions: Provides rich narratives about the surrounding environment, including details such as the layout, atmosphere, and activities taking place. - Customizable audio feedback: Users can adjust the volume, voice, and frequency of audio cues to personalize their experience. Intuitive interface: Easy-to-use gestures allow for seamless interaction and control over the app's functionalities.

Although vision is the most dominant of five senses, we often take it for granted. Without sight, we would struggle with basic tasks such as navigating around our homes and crossing the road. Yet 435,000 people in Australia alone are blind and 90% of these cases have no cure or therapy. We wanted to develop a tool that would help vision impaired individuals perceive the world around them. Beyond Sight is a groundbreaking project designed to empower the visually impaired with an intuitive and intelligent understanding of their surroundings. By leveraging cutting-edge AI technologies such as object detection and advanced computer vision, we aim to provide real-time audio feedback that paints a vivid picture of the environment.

We had to first figure out how to run Yolo in a stable manner on a user's device. This was important as we didn't want to rely on inferences happening on the server which can be delayed by several seconds. With inferences happening many times each second, it was challenging to filter and prioritise audio output from a constant stream of live data. To solve this, we developed an algorithm to prioritise objects that may present potential danger to the user, and prioritised group multiple types of the same object together to provide a more detailed output. From a market perspective, we needed to conduct market research on the features most useful to blind individuals, which we included in our live web app.

Try it out! >>

Features.

beyondsight

What's next for BeyondSight.

beyondsight

A web app helps you see more

What it does.

beyondsight

Combining the web app camera with in-built cameras of smart devices such as smart mobility canes, watches and sunglasses. Not only does this unlock huge revenue potential by combining software with hardware, it also allows for more features, such as danger vibration, step-detection and ease of movement. We should also train our own Yolo models to even more relevant objects for those with a vision impairment, for example: handrails, tacticle pavements, audio descriptive signs, etc. Furthermore, we can fine tune our sorting algorithm to create danger alerts and filter out repeated audio information. We also want to experiment with shortening verbal audio to shorter audio cues to prevent overstimulation.

Core Technologies.

beyondsight

Challenges we ran into.

beyondsight

- Object detection with YOLO v7: Running on device at over 10 inference runs a second, accurately identifies and tracks objects in real-time, providing crucial information about obstacles and points of interest. - Computer vision with GPT-4 Vision: Delivers insightful and contextual descriptions of the scene, capturing details that go beyond simple object recognition. - Audio generation with Deepgram: Converts text into natural-sounding speech using advanced text-to-speech technology.