How to use Whisper, Open AI and ChatGPT to improve inclusivity with an intelligent voice assistant in the On Wheels app

With the evolutionary wave of large language models (LLMs) come a lot new interesting applications. And what better way to use them then to make to world a better place, one line of code at a time. In this blogpost, we'll explain how we utilised LLMs to develop a voice assistant to increase inclusivity for an accessibility application.

Let's backtrack a bit and explain the context of this project, because no great invention is ever made alone. For this project Dataroots collaborated with On Wheels and Amai! Vlaanderen.

On Wheels is a dedicated organisation striving for improved accessibility of cities, buildings, and mobility facilities.

Functionalities of the On Wheels app: an interactive map that is customisable to the types of buildings that you want to see, wheelchair friendly routes, etc. — Excerpt from https://www.onwheelsapp.com/nl/app/

Amai! is an initiative of the Flemish government aimed at inspiring, advising, and empowering citizens with Artificial Intelligence in their daily lives, focusing on areas such as work, climate, mobility, and healthcare.

Joining forces with On Wheels and Amai!, Dataroots launched project STOEP, a Flemish acronym for "Slimme TOegankelijkheid & Empowerment voor Personen met een beperking" — or in English, "Smart Accessibility and Empowerment for Persons with a Disability". This initiative is designed to bolster the On Wheels application using AI. In project STOEP, we worked on multiple use cases, but this blogpost focuses specifically on the voice assistant that we developed for the app.

A video promoting project STOEP. For more info, see https://amai.vlaanderen/projecten/project2-rolstoel

In some respects, the On Wheels app mirrors Google Maps: It presents users with a map of various locations like restaurants, shops, and banks. However, its unique selling point lies in its extensive accessibility information for these places, covering aspects like door width, availability of ramps, accessibility of toilets, and more.

Image showing the On Wheels app on a phone

Initially, the app relied on manual data entry by its users through the haptic interface, such as a smartphone's keyboard. Recognising a need for wider inclusivity, we introduced a voice-powered functionality to cater to users facing typing or visual impairments. This adjustment is pivotal since this group of users might not be able to use and/or contribute to vital accessibility information. By providing a new voice-based interface, we have opened the doors for them to participate more fully.

Increased inclusivity with speech-to-text, NLP with large language models, and text-to-speech

To build a great voice assistant, we need it to listen, understand, and talk back, and be robust to changes in listening environment, adjustments to accessibility attributes, and natural variations in speech and content. That's why we use three main parts: Speech-to-Text (STT), Natural Language Processing (NLP), and Text-to-Speech (TTS). To do all of this in a cost-friendly way, we picked some great open-source solutions, where possible. Let's talk about each one.

Speech-to-text: For the conversion of audio to text, we selected Whisper, a respected open-source STT model. This adaptable model delivered dependable multilingual text transcription, accommodating a variety of languages including English, Dutch, French, and German. Notably, Whisper has undergone specific training on expansive datasets, enabling it to efficiently tackle different accents, handle background noise, and process technical speech.
Natural language processing: Instead of relying on traditional keyword detection or text similarity techniques, we decided to leverage the immense power of large language models, more specifically, OpenAI's managed LLM services. Here's where the real magic happened, and we'll delve deeper into this aspect shortly.
Text-to-speech: To complete the conversational loop, we utilised Coqui TTS, featuring a range of open-source models in multiple languages, including English, Dutch, French, and German.

Our project placed a significant emphasis on the NLP component, facilitating responsible and effective use of LLMs. As for the TTS aspect, it is typically managed by the end user's mobile device. Likely, in a final application, this feature would not rely on a third-party service to minimise response delay for the user. Instead, the mobile device would use its built-in text-to-speech engine to provide the spoken feedback, aligning more closely with the experiences that visually impaired individuals are already familiar with. This approach ensures a smoother, more intuitive interaction for all users.

Building state of the art natural language interfaces with LLMs in days, instead of years

Creating a conventional (i.e. pre-LLM) NLP pipeline, takes text input input and produces an appropriate output, and calls for a comprehensive approach. This typically involves stages such as data collection, labelling, curation, and model training and fine-tuning. Only after these steps can we attain a model that satisfies the minimum viable product (MVP) requirements. Yet, there's often still a considerable journey to improve the model's performance in various areas to reach a minimum likeable product (MLP), not to mention managing compliance and risk.

With LLMs, much of this process, right up to model deployment, can be greatly simplified. The crux of the task then becomes structuring the business logic consistently, integrating it into a prompt, and then testing with users. The most significant challenge with LLMs lies in their judicious and responsible implementation, meaning that we need to examine the context in which the AI will be used, and the potential impact it can have given its function.

For our specific use case, we needed to extract accessibility data about a location from the transcribed text input (the output from the user's speech input into the STT interface, Whisper). This includes measurements, addresses, the availability of toilets, the presence of ramps to overcome steps, and more. We sought an approach that could handle different languages, various phrasing styles, and basic mathematical logic, such as converting centimeters to meters or summing total step heights.

In our quest for the perfect solution, we tested several open-source models. However, none offered the kind of value for effort that we found with ChatGPT. This model demonstrated impressive flexibility in managing different languages and wording styles. It even managed to maintain context when multiple languages were used together. While it sometimes needed guidance, ChatGPT already performed remarkably well, paving the way for us to realise our ambitions to increase the inclusivity for the On Wheels application.

Managing risk in LLM applications with Guardrails

The real challenge with LLMs lies in compliance and risk management. As a safeguard to the model and a way to instruct it, we introduced Guardrails into the equation, enabling us to restrict the model's functionality and output schema, adhering to dynamic JSON output containing all essential variables.

Example input

<rail version="0.1">    
    <output>
        <string 
                name="text" 
                description="The generated text" 
                format="two-words" 
                on-fail-two-words="reask"/>
        <float 
                name="score" 
                description="The score of the generated text" 
                format="min-val: 0" on-fail-min-val="fix"/>
        <object 
                name="metadata" 
                description="The metadata associated with the text">
                <string 
                        name="key_1" 
                        description="description of key_1" />
            ...
        </object>
    </output>
</rail>

Example output

{
    "text": "string output",
    "score": 0.0,
    "metadata": {
        "key_1": "string",
        ...
    }
}

Guardrails offers a way to instruct the model and define its behaviour regarding units of measurement, variable formats, and handling missing information. For each variable we can supply its name, its type (string, integer, boolean, ...) and a description. The Guardrails package also allows us to provide choice variables, which are restricted to a specific list of options. Furthermore, we can create a nested schema of variables, which allows us to only store certain information if its relevant. For example, we only ask the accessibility of the toilet if the user indicates that a toilet is present at the location. Next, we can define validators and instruct the package on what it needs to do if the variable is filled in invalid by the LLM. We can instruct it to re-ask the LLM, filter the variable out, or fix the variable itself.

Guardrails is highly modular and provides logs to follow up on the interaction between Guardrails and the LLM. All the validation functionalities and re-asking functionalities are handled out of the box by Guardrails. We only need to supply the user input and define the schema and instruction. The package will handle all the rest in the back. We are assured that every output we get, is valid by our requirements. In conclusion, Guardrails allows us to control the output of the model and reduce the risk of undesirable output.

The conversation flow between user and AI

So, for every input of the user, we get a JSON output of the LLM with the help of Guardrails. Now, if we want to add onto this JSON and fill in the whole questionnaire, we need to ask the user to supply all the information that they missed. To do this, we created a small add-on to Guardrails to check which information is missing from the user input and ask relevant follow-up questions. The answer of the user is then processed again by Guardrails and the LLM to add the new variable to the JSON. All filled in variables are stored in the memory of the chatbot. Eventually, we get a full list of variables that can be sent to the back-end of the On Wheels app. This loop is visualised more clearly in the figure below:

The conversation loop where a human speaks to the voice assistant. This voice assistant will transcribe the text and then process it with the LLM and guardrails. The filled in variables are stored in memory while relevant follow-up questions are asked with the help of Coqui TTS to the user. After this, the loop repeats until all variables are filled in and stored in memory to be send to the On Wheels app.

An example

To illustrate the use of the voice assistant, we will show the transcription of a conversation with it (everything in this conversation is made up, it serves purely as an example):

Chatbot: Hi, welcome! Please give me some information about the location you want to add. This can be information about the function of the building, the address, the entrance or the toilet.

Person: Hi, I'm standing in front of a restaurant called Jam. Het restaurant bevindt zich op de bondgenotenlaan in Leuven. There are three steps in front of the door. In total, these steps are 55 centimeters high. The door width of the entrance is 2 meter and 5O centimeters. There is a toilet on the ground floor. The door width is 1 meter 50.

Chatbot: What is the type of food that they sell here? Choose from the options: African, Asian, dessert bakery, Belgian fries, Mediterranean, Middle Eastern or Western. Does the restaurant have vegan options? Does the restaurant have vegetarian options?

Person: The restaurant mainly sells Japanese food. It has vegan and vegetarian options.

Chatbot: What is the email address of the location? What is the street number of the location? What is the postal code of the address?

...

From this short conversation the chatbot outputs the following list of variables:

Name: ‘Jam’
Category: restaurant
Type of food: Asian
Vegan: True
Vegetarian: True
Address:
Street: Bondgenotenlaan
City: Leuven
Entrance:
Door width: 250 centimeters
Steps: 3
Total height: 55 centimeters
Toilet: True
Location: ground floor
Door width: 150 centimeters

We can notice some interesting aspects in the handling of the conversation by the voice assistant:

Our approach is very user-friendly and flexible. The voice assistant instructed us to categorise the type of food from a fixed list of options. Although we didn't do that and answered 'Japanese', the LLM could successfully categorise this as Asian.
Our approach is multilingual and robust. We specified the address in Dutch but the LLM had no problem understanding this switch of language and extracting the address correctly.
Our approach is logical. We specified some of the measurements in meters and some in centimeters. The LLM easily converted them all to centimeters (which we wanted). The LLM also only asked relavant follow-up questions for restaurants, like the type of food and vegan options.

Side note: Although the creativity of the LLM is often useful to interpret the flexible input of the user, this also poses a risk. This creativity allows to introduce bias as some edge cases can hit minorities in unexpected ways if you scale this up. Therefore caution and extensive testing is advised before opening this to the public.

Conclusion

In conclusion, we created a voice assistant that surpassed all expectations using off-the-shelf components. It proved to be multilingual, flexible, user-friendly, easily extensible, and robust. Thanks to the smart combination of an LLM with Guardrails we were able to create a voice assistant that maximises the advantages of LLMs while minimising the risk of unwanted output. With LLMs constantly evolving, the possibilities remain boundless. Our work, however impressive, is only the beginning, as advancements like GPT-4's function calling hint at even greater innovations.

As you delve into the world of intelligent agents and data querying, we invite you to explore more on the website of Amai!, discover our open-source GitHub repository, and connect with us for further insights. The future of accessibility is here, and we are thrilled to be part of the journey. Stay tuned for even more exciting developments in the realm of large language models!