VUX World podcast
Dec 16, 2021
7 MIN READ

What You Need to Know About Wake Word Detection

Wake words, or wake-up words, are your users’ first interaction with your voice assistant. A custom, branded wake word, can help users develop brand association and loyalty by repeating the brand’s name over and over again, sometimes multiple times a day. As a result, users will come to associate the effectiveness of the voice assistant with the brand, creating a lasting bond. While a wake word may just be a few syllables long, it does take time, effort, and investment to get it right. 

Recently, Darin ClarkSoundHound’s Director of Business Development, spoke with VUX World’s Kane Simms about wake word detection. During their talk, they discussed best practices for creating a wake word, which devices a wake word can be deployed on, why brands should have a wake word, and more. 

Watch the entire interview here or read the recap below. 

Kane: How common would you say is the requirement for a wake word for voice assistants? 

Darin: It’s becoming almost essential across all environments. What we’re seeing with regard to wake words is it hasn’t been used very often in a car environment. There’s always a push-to-talk button on the steering wheel or another button to invoke voice assistants, which was deemed to be sufficient for a long time. 

Now there’s definitely a push in automobiles to add wake words. It’s becoming much more ubiquitous across the automotive industry, but also as we’re moving into a post-COVID world. People don’t want to press buttons for any reason.

In a lot of cases, it’s becoming a much more accepted interaction. Smart speakers responding to voice commands have accelerated the requirement to have that functionality across any number of different devices.

Kane: I’ve heard wake word, voice trigger, wake phrase, and wake up word. Is there a standardized term? 

Darin: It doesn’t really matter. Most people know what you’re talking about, but you’re right. People refer to it as different things. 

There are lots of different ways to describe it. There hasn’t been an accepted industry term that has been settled on. Functionally, all of these terms are interchangeable, and people understand what is being referred to.

Kane: What goes into creating a wake word for a brand?

Darin: There are a couple of components to it—the core engine and the wake word model. 

First, you need a wake word engine, the technology that resides locally on the device. It shouldn’t happen in the cloud and should listen to the audio that’s going on around them. It should have precise phonetics of certain words or phrases that it will respond to. It also shouldn’t record or transcribe the audio that’s coming in. It should simply listen for the phonetic properties.

Then, you train the model to your phrase, whether that’s “Hey Pandora” or “Hey Mercedes.” Once the wake word model has been trained to listen for the wake phrase, it will activate the device once it hears it, so users can give a command or ask a query.

One thing you’ll definitely need in order to create the model is data or recordings of individuals speaking the wake phrase. Generally, it’s a few hundred people that speak the wake phrase, maybe 10,15, 20 times each. You also should gather a broad demographic swath of different genders, ages, accents, and regions across the country. 

That way, the model will be trained on the phonetic properties of that phrase and understand the different acoustics that go into each of the words in the phrase. It will be able to identify how the acoustics blend together into the pronunciation of the phrase. Then, the model can accurately respond when the phrase is pulled out of background noise. 

It may just be a single phrase, but it is a relatively intense process that takes some time. It’s critical that you don’t have false acceptances where it thinks a different word or phrase is the wake word and starts listening. You also don’t want false rejections where people say the wake word and the device doesn’t wake up. Either of these would lead to significant customer dissatisfaction. 

Kane: What can you say to the perception that voice assistants are listening all the time?

Darin: We can use a radar gun as an analogy. It could be used by police to see the speed of cars, a baseball scout to see how fast a pitcher is throwing a ball, or a golfer to see how hard they’re hitting. The radar gun doesn’t particularly care. It uses mathematics to determine how fast things are going. It’s not recording everything that’s happening all the time. It’s just waiting until it sees that object go through the frame.

It’s the same thing with a wake word. The audio is being passed into the microphone, but it’s not being recorded. It’s only listening to the acoustic properties of the audio that’s coming into it. Once it finds one that matches, then boom, the device is activated. Before that, it’s not streaming audio up to the cloud. It’s not even recording it locally on the device. It’s just passing through the microphone, trying to determine if somebody has spoken the phrase that it’s specifically listening for.

There are going to be words that sound similar that trigger false acceptances, which is part of developing a good wake phrase. Brands should look for phrases or words that wouldn’t particularly match a lot of other phrases in the target language. 

At SoundHound, we provide a service where we give guidance on ideas for wake words that brands would want to use and feedback on which ones would have fewer false acceptances. You want some phonetic variation in the phrase, differentiation between consonants and vowels. You don’t want the same sounds repeated. 

So, our team can look at specific wake phrases and let brands know if it has enough phonetic variation or if the phrase is too common in the language. We can help with recommendations.

Kane: So what are best practices for creating a wake word? Santa wouldn’t want a voice assistant called ho-ho, right?

Darin: Exactly right. That’s a great example. Historically, we recommend 4 to 5 syllables. The longer it is, the easier it is to differentiate between other similar-sounding words or phrases, and the fewer false acceptances you’ll have.

You can’t really get phonetically varied with only 2 syllables. If you’re a brand with only a 2 syllable name, you can add a “Hey,” “Hello,” or “Okay” at the beginning in order to get that up to 3 or 4 syllables and have more phonetic variation in the phrase.

You also want to maintain the brand, but you also want to differentiate between when people talk to the device versus talking about the device. Take Google, for example. If the wake word was just “Google,” the voice assistant would be going off all the time whenever someone said, “I’ll just Google it.” 

Even though Spotify is a long enough word, many people still talk about their Spotify playlist or listen on Spotify. It needed the “Hey” in front of it to differentiate it. 

Kane: That is such a good insight. I always thought that it was just because they were trying to be cool and trendy, but there’s a practical reason. So, we’ve got something that’s audibly different, something that’s not too short, and something that is probably not too long. Is there a maximum length? 

Darin: There really isn’t a maximum length, but for user experience reasons, brands wouldn’t want it to be too long. Nobody wants to say a 10 syllable phrase every time they want to activate their device, and users also may stumble over longer phrases. 

Kane: Are there any other common best practices that you would advise brands thinking about wake words to consider?

Darin: Depending on where you’re deploying your voice assistant, you may want to consider other languages, which may have different sounds. There are certain words or phrases that native Mandarin speakers can’t speak as well because the sounds don’t have the same properties in their language as in ours. Similarly, English doesn’t have the same properties as German or other Eastern European languages. If you want to deploy your voice assistant worldwide, this should be considered. 

Kane: Are there any limitations to the actual device type that a wake word can exist on?

Darin: Conceptually, it can exist on any type of device. There are environments that make it more challenging. We’re getting better at recognizing noisy environments and using less processing power. 

If you’re working on an airplane tarmac, that’s going to be more difficult for the voice assistant. For a device like a smart earbud, a voice assistant could drain the battery pretty fast just by listening for the wake word. Low power type devices have their own challenges, but generally speaking, they can be deployed. 

Children’s toys have a similar obstacle where young children speak phrases differently than adults. It can be challenging to create wake words for children that respond consistently and accurately.  

All of these challenges are an order of magnitude better than they were a decade ago or even 3 years ago. They’re all improving significantly year over year in terms of being able to recognize children’s speech, sift through noisy environments, and work well with lower power devices. 

As we continue to evolve in this space of voice assistants and voice technology, I expect that the majority of these constraints will be alleviated, and people will develop ways to address them.

Ready to learn more? Watch for part 2 of our recap of the conversation and stay tuned for more blogs on the topic of wake words.

Questions? You can email Darin directly at [email protected] or check out Darin’s LinkedIn, and he will be happy to respond. 

Interested in learning more about SoundHound’s wake word technology? Visit our website here or fill out this form to hear from our business development team. 

Watch the entire interview below and subscribe to the VUX World newsletter to keep tabs on what’s coming next. 

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.