Finding Your Brand's Voice:
6 Ways to Build a Better VUI

As the world moves toward more voice-enabled products and services, it’s imperative for brands to have a voice-first strategy. Read this best practice guide for expert advice on building a better voice user interface (VUI) for your brand.

Prompting and Error Recovery

Keeping the Conversation Going

For the most part, voice user interface (VUI) is an audio-only experience. That being said, designers need to key into the flow of two-way conversations and create ways for the assistant to help the human user navigate, find things, and most of all, feel heard. Assistants can often give more information than the human can absorb, but at the same time, users need enough information to complete their task.

Though some smart speakers like Amazon Echo and Google Assistant only have voice interfaces, the visual component to the app versions of each and any VUI is of utmost importance too. It’s easy to forget the visual interface when you’re working on voice technology. But visual cues help users understand what’s happening, and what they need to do to take action.

Since AI has limited parameters and humans have nearly infinite ways of conversing, count on some misunderstandings between the two. This doesn’t have to be the end of a budding relationship. If your VUI misunderstands or sends users down an irrelevant path, look for ways to both recover, and for the voice assistant to learn from its mistakes.

Then, there’s the issue of the users themselves. RAIN Producer Ben Steele explained it best: “I don’t know what to say,” he confessed. “I’m bad at talking to them. I stutter over any command I haven’t repeatedly used before to the point where even another human being wouldn’t be able to derive a coherent sentence out of it. I’m just not able to process my side of the interaction in the same way that I would with another person.” He adds, “I have a limited window into things it can do based on what I’ve needed to get it to do in the past and maybe a few things I’ve derived from commercials. Even beyond the features, I don’t really know how I should structure my sentences and I don’t know if I mess up my sentence if I need to just give up and start again or push through.” In the same vein, scripting for voice assistants must be carefully crafted so the human users learn how to interact with it. Steele adds “I don’t really know what phrases are attached to functionality even if my words are understood, so I’m just sitting there trying to shout keywords at it. Me and my voice assistant, we’re not on the same wavelength.”

In this chapter, we’ll discuss how to help machines and humans communicate better.

Best practices:

Give users confirmation and a subtle roadmap

How do you know when the VUI is listening? How do users know how to summon their voice assistant if they don’t want to call its name? What cues do people need to know when to talk and when to listen? The listening screen conveys that the app is focused on getting user input through voice — not unlike when the keyboard slides up to get user input through text. This screen usually appears in the form of an overlay and can take up as much real estate as you feel is necessary. The listening screen should contain several key elements that help bring the voice and visual experience together.

The first consideration is a microphone button or other part of the speaker that lights up as an easy-to-spot indication that the user is being heard. Siri’s multi-colored sound waves show users that Siri is listening. The animation has the same purpose as Alexa’s blue ring of light or Hound’s twirling blue ring.

It’s also helpful to visually represent the audio input because it provides a stronger sense of connection between the user’s voice and the product’s understanding. Common visualizations include animated sound waves and level bars. And show users the way out. Provide a way to turn off the listening screen at any time.

The audio cues are just as important as the visual ones so users can tell that the device is listening even if they’re not looking at the device.

Here’s some advice from Octavio Menocal, senior voice experience engineer at RAIN: “Include in the reprompt the type of phrases you expect from the user in the context. If users respond with an unexpected intent, catch the failure, and send a response including the type of phrases your app expects to move forward.”

Here’s an example of a taxi voice skill in Alexa:

Alexa: Welcome to taxi driver. I can contact a taxi to drive you to your destination, or give you a quote, which would you like?

User: I want to go to the airport (Unexpected intent)

Alexa: Excellent! I just need to know if you want me to contact a driver, or if you simply need a quote, what’s your interest?

User: Oh right, I want you to find a taxi for me.

Alexa: Great! You told me you wanted to go to the airport, correct?

User: Correct -> (User is pleased the app remembered his destination and didn’t ask for it again)

Alexa: Ok, now, where would you like the driver to pick you up?

(User does not remember and asks her mom next to her, while talking to Alexa)

User: Mom, what’s the name of this street?

Mom: William Street, 84404 (Yes! This dialog happened at the time user was talking to Alexa, and guess what, Alexa heard: “What’s the name of the street we will be at, at 4pm (Unexpected intent))

Alexa: Sorry, I didn’t understand, where would you like the driver to pick you up? (We didn’t stop the conversation, instead we handled the FallbackIntent and returned a valid reprompt)

User: William Street, 84404

Alexa: Ok, give me a second, I’m looking for the closest drivers -> This is a progressive response while your app gets information from your external API

Alexa: Ok, I found 5 drivers available, the ETA to pick you up is 5 minutes, the price is $40, do you want me to contact the closest driver right now?

User: Yes

Alexa: Excellent, the driver confirmed he’s coming in 5 minutes. I just sent you an email with the receipt of your order. Thanks for using taxi driver, have a nice trip!


Create a strong strategy for error recovery

“Human conversation is naturally replete with errors,” says Lauren Golembiewski, CEO and co-founder at Voxable. “Through experience, most human brains learn how to correct conversational errors fairly seamlessly. On the other hand, VUIs need to be encoded with extensive error recovery.” Golembiewski cites an example: “What happens when a user says something unexpected to a VUI prompt? How should the system respond if a user says ‘I don’t know’ to the VUI prompt: ‘Are you sure you would like to transfer $2000 to that account?’” This error recovery is as simple as considering what might happen in these kinds of scenarios. But it can get more complex depending on the risk involved in the interaction.

At Marvee, CEO and VUI designer Heidi Culbertson depends on analytics to help her get to know her audience and understand how they think. “The better you know the actual user, the better your error management will be,” she says. “Error management is literally taking it almost word for word and designing your VUI so that you don’t end up in an endless loop.” Another pitfall with errors is the possibility of losing the audience. “You lose retention when it’s not an easy experience,” she explains.

Octavio Menocal
Senior Voice Experience Engineer
Don’t stop the conversation when an error occurs. If your app has a syntax error or maybe an operation with an external resource fails, return a response informing the user something unexpected happened and they can come back and try again after you have fixed the error.

“Don’t stop the conversation when an error occurs,” Menocal adds. “Some frameworks offer a nice handler to catch an unexpected error in the code. If your app has a syntax error or maybe an operation with an external resource fails, return a response informing the user something unexpected happened and they can come back and try again after you have fixed the error. You can also print that error in your server, or send it to your email, Slack or via SMS so you get alerts of what’s wrong and think of how you can fix it quickly.”


Help keep users from getting tongue-tied

SoundHound Inc. has been designing voice- enabled AI technologies for 14+ years. If there’s one thing we’ve learned from implementing voice experiences, it’s that users usually aren’t sure what to say. Sometimes users don’t know that they should say something when the listening screen appears. Even though they have initiated this screen with the tap of a microphone button, they are often caught off-guard and can feel like they are being put on the spot. In fact, our data shows that many first-time experiences are met with silence.

Prompt asking user to speak inside the Hound App

Transcribing to show active listening

When the user begins to speak, a transcription is displayed so they can confirm the accuracy of their input. It’s important to note that with our Houndify technology, transcriptions will change and update as the user speaks. This is because we gain context clues on the fly that allow us to more accurately understand what’s being said, as it’s being said.

Transcriptions end shortly after the microphone stops hearing voice input. A small pause (about 2 seconds) before completing the transcription helps to avoid cutting of a user’s input too soon.

After the transcription ends, the text is sent to the server so that a response can be provided. We recommend using a sound effect here to communicate that transcription has ended and the search has begun.

The time it takes to get a response depends on several factors, like connection speed or complexity of the query. During this process, use some type of loading indicator to visually communicate that the search is being performed. Make sure to keep the transcription visible during the searching process so the user doesn’t lose context (a surprising amount of users mentioned this in our user research). Once the search is complete, the listening screen will animate out and results should be displayed.

  • Call to Action: Microphone button: some affordance of voice interaction must always be visible on screens where voice interaction is enabled — the most widely accepted form of visual affordance is the microphone icon.
  • Wake Word: The existence of a wake word is not something users can see, so it must be taught.
  • Listening Screen: The listening screen contains several key elements that help bring the experience together. A text prompt guides the user into action, and the audio input visualization (usually animated sound waves) communicates that the app is actually hearing what the user is saying.
  • Processing: Use a sound effect to communicate that transcription has ended and the search has begun. Use some type of loading indicator to visually communicate that the search is being performed. Make sure to keep the transcription visible during the searching process so the user doesn’t lose context.
  • User Feedback: Allowing users to give feedback helps you decipher similar-sounding terms. Secondly, a feedback mechanism allows users to feel empowered and provides an actionable follow-up to an incorrect query.

This feedback mechanism can be as simple as a thumbs-up or thumbs-down selector, or you can let the user type a more detailed message to explain what went wrong.

In the next chapter, learn how to improve user onboarding, education, and discovery. With voice technology, we have to learn how to talk to inanimate objects. Teach users a new way to communicate.

Watch the VUI Best Practices Webinar
Building Better Voice Experiences - What Brands Need to Know
As voice-enabled products and services become more ubiquitous in our lives, consumers want exceptional brand experiences via voice. Join us for a webinar on best practices for building a better voice user interface (VUI), and hear from experts on best practices and key learnings.
Talk to Us
Have questions? Contact our sales team to discuss custom Voice AI solutions.
Get a Developer Account
Access Houndify APIs, SDKs, and tools with a free developer account.
SoundHound Inc. Logo
At SoundHound Inc., we believe every brand should have a voice. As the leading innovator of conversational technologies, we’re trusted by top brands around the globe.
LinkedIn iconfacebook icontwitter icon
language icon
English International
headquarters icon
5400 Betsy Ross Drive, Santa Clara, CA 95054 (408) 441-3200
©2021 SoundHound Inc.