It’s essential that voice assistants accurately detect wake words—even in noisy environments where music, talking, wind, and other background sounds can create either a false positive or a lack of response. As voice assistants gain popularity and voice interfaces expand from smart speakers to restaurants, drive-thrus, airports, and cars, their ability to function in noisy environments is becoming paramount to adoption. Voice assistants able to respond with accuracy and speed amidst a sea of noise are those that will continue to meet—or exceed—users’ expectations.
Recently, Darin Clark, SoundHound’s Director of Business Development, spoke with VUX World’sKane Simms about wake word detection. As part of that conversation, they took a deep dive into how wake words can be trained to overcome noisy environments, new use cases for voice assistants, and more.
Kane: I can see how noisy environments would be a challenge. Can you talk more about that?
Darin: Yeah, there are always challenges in noisy environments.
15, 20 years ago, it had to be really quiet for a wake word to work well. Then, we began to use different technologies, like noise cancellation and echo cancellation, in order to specifically enable the wake word engine to ignore extraneous noise and focus on listening for the specific wake word or phrase.
Then, we moved into a time when we were pretty good with background noise but struggled if the device was playing music or something. Adding echo cancellation really improved that scenario.
Similarly, we would also have problems with very far-field types of environments where people are a long way away from the device. That creates a lot of echoes, but utilizing technology like beamforming has enabled us to work at pretty significant distances. 20, 30, 40 feet seem to be no problem in a lot of cases.
Now, we’re in this scenario where we’re combining all of those technologies into one. So even when there are environments with a lot of background noise, when the device is making outbound noise from talking or playing music, or somebody is speaking from a long way away, the device will still work.
For something like a drive-thru food ordering experience, like we’ve been doing with MasterCard, there are some definite challenges there with an external environment. There are two drive-thru lanes, each with people talking. During testing, after somebody ordered, they would turn their music way up and weren’t concerned about everybody around them trying to order.
Realistically and reasonably, we’ve reached the point where voice technology can address that. The benefit of that scenario is that the driver is pretty close to the voice-enabled kiosk. If they were 5, 10, 15 feet back, there would be a lot more background noise and other interesting challenges.
Another near-field environment example would be a mobile phone because people tend to bring the phone right up to their mouth and speak directly into the microphone. Oftentimes, they even speak too loud. So we have to address that environment where there’s a distortion based on how close they are and how loud they are speaking to the device to wake it up.
What we’ve seen over the last few years is that we’re starting to make lots of improvements, not just incremental improvements in the quality and ability to recognize wake words, but really exponential improvements in various environments.
Kane: Would restaurants be another example of noisy environments?
Darin: Yes, the big thing about restaurants is that you have people saying the same thing to multiple voice-enabled kiosks that are right next to each other. We want to make sure that the wake word activates the kiosk the person is in front of, but not the ones next to it.
Same with kiosks in airports, train stations, or retail stores. The kiosk needs to recognize the person speaking in front of it versus somebody that’s standing in front of a different kiosk.
Kane: That’s a great point. We actually spoke to Andrew Richards, your colleague from SoundHound about speech recognition in noisy environments. If you want to know more about that, please do check out the one with Andrew, which is absolutely immense. So, is it the model or the engine that you have to develop for this kind of differentiation?
Darin: In this case, there are multiple aspects to it. When we train the model, we can train it with specific acoustic properties for the type of device it’s going into. In order for customers to order more accurately, we would train the model for far-field environments or differentiation between people for kiosks.
There’s definitely an aspect of training the model in such a way that it has an understanding of what type of environment the wake word is going to be in, whether that’s far-field or near-field, in a car with background noise, or a mobile app that will be used on the streets.
There are also device-specific capabilities. Noise cancellation, echo cancellation, or beamforming software or hardware technologies can be inherent to a specific device. For a smart speaker, you might have certain beamforming technologies. Similarly, at a drive-thru, you may want to incorporate noise cancellation technology in order to more accurately recognize not just the wake word but also everything that comes after it.
Kane: What’s the difference between training a model on collected data versus speechrecognition training, which is trained constantly, almost indefinitely over time?
Darin: Good question. It generally comes down to a few things. Definitely, the more data, the better for training a wake word model. The more samples you have of people speaking the wake word could improve it, but then you get to a law of diminishing returns. There’s a point where the incremental improvement with more data is such that it doesn’t justify the additional investment to get more data.
The main reason that there is a difference between core speech recognition models is that with a wake word, you’re only looking for a very small set of phonetic properties. You don’t have to think about recognizing everything in the dictionary or even a specific set of commands.
There are lots of different phonetics that go into each of those, and you have to train to ensure that those don’t conflict with each other. For core speech recognition, more data is almost always better. It can always be improved.
When it comes to a wake word, once you get into the core phonetics of the phrase, you understand there can be variations in the ways that people say the phrase, but you can get those variations by that demographic swath in the initial data. Once you get that, there’s not really a lot else to do. You can train a little bit more for the specific environment or the specific device to improve the model for that specific market. But generally speaking, once it’s done, you’re going to get good results.
Kane: Is this something that SoundHound handles for brands from the beginning to the end of the process?
Darin: Absolutely. We can work with a brand and give them recommendations on potential wake words if they’re thinking about various phrases. Then, once we agree on what the phrase will be, we go collect the data. Once we create and train the models, we test them and can do retraining based on additional data from the target device.
We have a number of customers that utilize our wake word technology. Some of our automotive customers, as well as, companies like Pandora and Deutsche Telecom.
Kane: Do most customers use the full SoundHound technology, or do some only use a component?
Darin: Most of them use the full suite, but we definitely have customers who are only using our wake word technology or ASR technology. The majority use everything, but it’s also an option to take components.
Kane: Do you agree that voice assistants will be extended into new environments and channels?
Darin: Yeah, that’s our expectation too. We’re seeing companies acknowledge and recognize the value of a custom branded voice assistant. Once brands understand that, then they want a custom wake word that’s going to reinforce their brand and enable customers to continue to get that brand loyalty.
We definitely see where these types of voice assistants, and particularly, wake words are moving to across different devices. We have automotive partners who have voice assistants in their vehicles and want us to enable a voice assistant on a companion app, so drivers can check the status of their vehicle while in the living room or how much gas or charge the vehicle has.
Similarly, in food ordering, a QSR will have a drive-thru experience, but then they’ll want to expand that to a voice-enabled kiosk, mobile app, or phone ordering system. Brands are looking to have capabilities delivered across various services and devices. It’s going to be critical to have a voice assistant and wake word that can be across any number of different devices.
Kane: Looking at SoundHound’s big enterprise clients of Pandora, MasterCard, and Mercedes, is it feasible for mid-market companies to be seriously considering wake word technology?
Darin: I think that it’s definitely possible for a smaller to mid-market-sized company to create their own wake word or work with us to create that wake word.
There is cost involved, and there’s effort involved. There are lots of investments that need to happen both in terms of getting the data to train the wake word model and also an investment of human capital to actually do the engineering work to create it.
With that being said, the advantage of it is that once you create the wake word, it can be deployed across any type of device without having to go back and retrain or invest in more development. Brands could then have the ability to reach customers with their voice assistant in any number of different ways.
That’s how we’ve developed SoundHound’s Houndify platform. It’s an open developer platform where users can create an account, sign up, log in, and deploy a custom voice assistant using a number of our online, web-based tools. We want to enable a broad swath of customers to have their own custom voice assistant for any type of device.
We’ve talked a bit about different areas that voice assistants are coming to, and others include elevators, conference rooms, office buildings, airplanes, and everywhere else you can think of where people are going to want to talk to a device to get things done.
Particularly, in a post-COVID world where people are much more reluctant to touch buttons or screens that many other people might have touched, having this ability to speak to a device is going to increase in value and usage across all these different types of services.
Kane: Interesting. One final question. Where do you think the next area or current area is where you would genuinely like to see more activity around wake word detection and custom voice assistants?
Darin: The area that seems to be one of the hottest right now is on the food ordering side. There have been a lot of restaurants making recent announcements about investing in voice technology.
The reality is a lot of these restaurants are having trouble finding employees. So, they’re looking for a way to augment the labor shortage problems that they’re having. I think that’s one area that we will see continue to make significant investments and shifts in usage patterns of voice assistants. There will be an acceptance of wake words and voice assistants in quick-service restaurants that will be coming and will continue to increase over the near term.