For the most part, voice user interface (VUI) is an audio-only experience. That being said, designers need to key into the flow of two-way conversations and create ways for the assistant to help the human user navigate, find things, and most of all, feel heard. Assistants can often give more information than the human can absorb, but at the same time, users need enough information to complete their task.
Though some smart speakers like Amazon Echo and Google Assistant only have voice interfaces, the visual component to the app versions of each and any VUI is of utmost importance too. It’s easy to forget the visual interface when you’re working on voice technology. But visual cues help users understand what’s happening, and what they need to do to take action.
Since AI has limited parameters and humans have nearly infinite ways of conversing, count on some misunderstandings between the two. This doesn’t have to be the end of a budding relationship. If your VUI misunderstands or sends users down an irrelevant path, look for ways to both recover, and for the voice assistant to learn from its mistakes.
Then, there’s the issue of the users themselves. RAIN Producer Ben Steele explained it best: “I don’t know what to say,” he confessed. “I’m bad at talking to them. I stutter over any command I haven’t repeatedly used before to the point where even another human being wouldn’t be able to derive a coherent sentence out of it. I’m just not able to process my side of the interaction in the same way that I would with another person.” He adds, “I have a limited window into things it can do based on what I’ve needed to get it to do in the past and maybe a few things I’ve derived from commercials. Even beyond the features, I don’t really know how I should structure my sentences and I don’t know if I mess up my sentence if I need to just give up and start again or push through.” In the same vein, scripting for voice assistants must be carefully crafted so the human users learn how to interact with it. Steele adds “I don’t really know what phrases are attached to functionality even if my words are understood, so I’m just sitting there trying to shout keywords at it. Me and my voice assistant, we’re not on the same wavelength.”
In this chapter, we’ll discuss how to help machines and humans communicate better.
How do you know when the VUI is listening? How do users know how to summon their voice assistant if they don’t want to call its name? What cues do people need to know when to talk and when to listen? The listening screen conveys that the app is focused on getting user input through voice — not unlike when the keyboard slides up to get user input through text. This screen usually appears in the form of an overlay and can take up as much real estate as you feel is necessary. The listening screen should contain several key elements that help bring the voice and visual experience together.
The first consideration is a microphone button or other part of the speaker that lights up as an easy-to-spot indication that the user is being heard. Siri’s multi-colored sound waves show users that Siri is listening. The animation has the same purpose as Alexa’s blue ring of light or Hound’s twirling blue ring.
It’s also helpful to visually represent the audio input because it provides a stronger sense of connection between the user’s voice and the product’s understanding. Common visualizations include animated sound waves and level bars. And show users the way out. Provide a way to turn off the listening screen at any time.
The audio cues are just as important as the visual ones so users can tell that the device is listening even if they’re not looking at the device.
Here’s some advice from Octavio Menocal, senior voice experience engineer at RAIN: “Include in the reprompt the type of phrases you expect from the user in the context. If users respond with an unexpected intent, catch the failure, and send a response including the type of phrases your app expects to move forward.”
Alexa: Welcome to taxi driver. I can contact a taxi to drive you to your destination, or give you a quote, which would you like?
User: I want to go to the airport (Unexpected intent)
Alexa: Excellent! I just need to know if you want me to contact a driver, or if you simply need a quote, what’s your interest?
User: Oh right, I want you to find a taxi for me.
Alexa: Great! You told me you wanted to go to the airport, correct?
User: Correct -> (User is pleased the app remembered his destination and didn’t ask for it again)
Alexa: Ok, now, where would you like the driver to pick you up?
(User does not remember and asks her mom next to her, while talking to Alexa)
User: Mom, what’s the name of this street?
Mom: William Street, 84404 (Yes! This dialog happened at the time user was talking to Alexa, and guess what, Alexa heard: “What’s the name of the street we will be at, at 4pm (Unexpected intent))
Alexa: Sorry, I didn’t understand, where would you like the driver to pick you up? (We didn’t stop the conversation, instead we handled the FallbackIntent and returned a valid reprompt)
User: William Street, 84404
Alexa: Ok, give me a second, I’m looking for the closest drivers -> This is a progressive response while your app gets information from your external API
Alexa: Ok, I found 5 drivers available, the ETA to pick you up is 5 minutes, the price is $40, do you want me to contact the closest driver right now?
Alexa: Excellent, the driver confirmed he’s coming in 5 minutes. I just sent you an email with the receipt of your order. Thanks for using taxi driver, have a nice trip!
“Human conversation is naturally replete with errors,” says Lauren Golembiewski, CEO and co-founder at Voxable. “Through experience, most human brains learn how to correct conversational errors fairly seamlessly. On the other hand, VUIs need to be encoded with extensive error recovery.” Golembiewski cites an example: “What happens when a user says something unexpected to a VUI prompt? How should the system respond if a user says ‘I don’t know’ to the VUI prompt: ‘Are you sure you would like to transfer $2000 to that account?’” This error recovery is as simple as considering what might happen in these kinds of scenarios. But it can get more complex depending on the risk involved in the interaction.
At Marvee, CEO and VUI designer Heidi Culbertson depends on analytics to help her get to know her audience and understand how they think. “The better you know the actual user, the better your error management will be,” she says. “Error management is literally taking it almost word for word and designing your VUI so that you don’t end up in an endless loop.” Another pitfall with errors is the possibility of losing the audience. “You lose retention when it’s not an easy experience,” she explains.
“Don’t stop the conversation when an error occurs,” Menocal adds. “Some frameworks offer a nice handler to catch an unexpected error in the code. If your app has a syntax error or maybe an operation with an external resource fails, return a response informing the user something unexpected happened and they can come back and try again after you have fixed the error. You can also print that error in your server, or send it to your email, Slack or via SMS so you get alerts of what’s wrong and think of how you can fix it quickly.”
SoundHound Inc. has been designing voice- enabled AI technologies for 14+ years. If there’s one thing we’ve learned from implementing voice experiences, it’s that users usually aren’t sure what to say. Sometimes users don’t know that they should say something when the listening screen appears. Even though they have initiated this screen with the tap of a microphone button, they are often caught off-guard and can feel like they are being put on the spot. In fact, our data shows that many first-time experiences are met with silence.
When the user begins to speak, a transcription is displayed so they can confirm the accuracy of their input. It’s important to note that with our Houndify technology, transcriptions will change and update as the user speaks. This is because we gain context clues on the fly that allow us to more accurately understand what’s being said, as it’s being said.
Transcriptions end shortly after the microphone stops hearing voice input. A small pause (about 2 seconds) before completing the transcription helps to avoid cutting of a user’s input too soon.
After the transcription ends, the text is sent to the server so that a response can be provided. We recommend using a sound effect here to communicate that transcription has ended and the search has begun.
The time it takes to get a response depends on several factors, like connection speed or complexity of the query. During this process, use some type of loading indicator to visually communicate that the search is being performed. Make sure to keep the transcription visible during the searching process so the user doesn’t lose context (a surprising amount of users mentioned this in our user research). Once the search is complete, the listening screen will animate out and results should be displayed.
This feedback mechanism can be as simple as a thumbs-up or thumbs-down selector, or you can let the user type a more detailed message to explain what went wrong.
In the next chapter, learn how to improve user onboarding, education, and discovery. With voice technology, we have to learn how to talk to inanimate objects. Teach users a new way to communicate.