Building voice AI in noisy environments with Andrew Richards
Aug 24, 2021
7 MIN READ

Building Voice AI in Noisy Environments for Great User Experiences

It’s rare to have an interaction with a voice assistant that doesn’t have some form of background noise. The user could be in a car with the windows rolled down, at a quick-service restaurant with other customers’ chatter, or on a sidewalk with street noise. Whatever the background noise is, a voice assistant needs to be able to filter through it and focus on the person asking the question. Otherwise, it will lose its accuracy, have false positives and negatives, and create frustration for the user. 

While many voice assistants reside in the home, 130 million users are currently using in-car voice assistants, according to Voicebot.ai. In-car voice assistants are a prime example of a voice user interface that must contend with a noisy environment—including engine and road noise, other passengers talking, wind and noise as a result of windows rolled down, and music playing. With brands expanding their use of voice assistants away from the home, it’s vital that the voice assistant is able to understand the user’s questions and commands in spite of background noise. Noisy environments aren’t limited to in-car voice assistants. Echos, background conversations, machine noise, and music are present in many environments where voice assistants are present.

VUX World’s Founder, Kane Simms, sat down with SoundHound’s Director of Business Development, Andrew Richards, to discuss the importance of building voice AI in noisy environments, how microphones play a role, noise cancellation technology, and training models with background noise. 

During the interview, Kane and Andrew spoke on a variety of topics on voice AI in noisy environments. The following are some of the highlights from that conversation. Want to take a deeper dive? You can view the interview in its entirety here.

Kane: Can you talk about the kind of noisy environments that you’ve had to work with outside of the home?

Andrew: Yeah, so there’s automotive, which is outside of the home, and there are some acoustic challenges there. The smartphone also leaves the home, and you use it in different environments.

With smartphones, there could be music in the background, people go on trains, or there are cars and street noise. All of that background noise goes into the microphone. So that’s where we started with our experience with managing different sources of noise. 

Where it gets complicated is definitely in the automotive industry, whether there’s engine noise depending on the speed of the car, there are kids in the back screaming, people honking their horns, or the windows are down. There are so many different parameters that can change in terms of the noise that you get in a car environment. 

Kane: When you’re working with a car manufacturer or another device that is used away from the home, do you have any influence over the choice of the microphone? 

Andrew: It’s generally an array of microphones in the car. We don’t have much influence in terms of the microphones that they use. When you’re talking about a smartphone, we can’t influence Apple or any of the other manufacturers when it comes to the microphone.

So they have their microphones and technology, and we have to work with it. There are some projects where we get to work with them before they’ve actually designed the hardware. In those cases, we can talk about microphones and provide some advice. The main advice we can give is to have an array of microphones so that you can get several sources and send the cleanest one to the speech recognition engine.

Kane: Is it more about the placement of the microphones, or is it more about the number of microphones?

Andrew: So the advantage of microphone arrays is there’s just one chip, and you just stick it somewhere, and then it picks up the audio from various different zones in the car. I don’t think anyone would really want to plug in 12 microphones in different areas of the car. It wouldn’t be an acceptable solution. 

Kane: What is noise reduction in this environment?

Noise reduction is really useful in the car. If you make a phone call and you use the microphone in the car, you’re going to get a lot of background noise, and it’s not pleasant to hear all of that engine noise when you’re on the phone. So, essentially that noise reduction is useful for that use case.

When it comes to the way our technology works, we train our ASR to manage the noise. There’s this expression that I really hate, which is throwing the baby out with the bathwater. But if the noise reduction is applied, it can remove some of the information that we need to do that. So it can actually do more harm than good to apply noise reduction given our ability to manage the noise.

Kane: Is it that sometimes there may be a negative customer experience because of the elements that you can’t control, or is it that you do some post-processing that tries to make up for a weak signal?

Andrew: It’s not exactly post-processing. First of all, we all have to manage with background noise. If you have kids, unwanted background noise is part of your life. As humans, we manage that really well. 

I like to use the example of robotics companies. We’ve seen some amazing videos of these robots that can jump in and do all sorts of crazy stuff. But it took them forever to teach these robots to walk. For us as humans, it’s natural. We put one foot in front of the other without thinking about it. 

Similarly, we’re really good at distinguishing between human speech and other sources of noise. We don’t give it a second thought because it comes easy to us until we start getting older. Then that algorithm in your brain starts having trouble, kind of switching things off and tuning out that background noise and eliminating speech along with it.

What we do at SoundHound is much like processing human speech. With our Speech-to-Meaning® technology, we’re trying to mimic how humans perceive speech. 

Essentially, there are three ways of managing background noise:

  1. Prevent the noise from getting into the microphone
  2. Noise reduction—such as removing the noise
  3. Just deal with it

Our approach is to figure out how to recognize and identify speech in a signal and do the best we can with it, regardless of how much noise it comes with. 

It’s probably worth explaining the difference between the two main components of speech recognition. 

With ASR, there are two main components:

  1. The language model 
  2. The acoustic model 

Essentially, the acoustic model is the first component, which receives a sound signal. It tries to identify speech patterns in that signal and convert those speech patterns into phonemes. Once you have the phonemes, the language model is what turns them into real words and sentences.

Kane: Then, how does the model actually differentiate between the speech and the background noise? 

Andrew: What we do is we train our acoustic models with the noise. We take the sentence of someone saying, “The quick brown fox jumped over the lazy dog,” and include all of this background noise. 

Noise like unwanted background speech is particularly difficult because you’re trying to teach this algorithm to recognize and identify human speech. At the same time, you’re trying to tell it that there’s just one specific person that I want you to listen to and ignore all those other voices.

So, we have to teach it that you can have one voice and then several voices underneath it that you want to completely ignore. We train it with all kinds of unwanted background noise that can be anything, such as dogs barking. We have to cut through that noise by including it in our models. 

As a developer, you can switch between different acoustic models on the fly. So if you have a device that could be both near-field and far-field (near-field is where you’re very close to the microphone, and far-field can be several meters away), depending on the microphone that picks up the voice, you can send it to a different endpoint. You can switch acoustic models. 

If we see that the acoustic model isn’t performing in an optimal way in a new use case, then it could be worth collecting data from the actual environment. But typically, we hope we would have already collected a lot of data in those environments. For cars specifically, we add things like indicator noise, AC fans, and engine noise at different speeds. 

With a car, you’re shifting gears and going from 30 miles an hour to 50 or 60, and those frequencies are changing constantly. What we do then is collect data at different speeds, so you have all the engine noise for that model of car trained at different speeds. We also use other noise like the window wipers and in different driving conditions, such as rain.

The Lombard Effect is an important part of this. If you mix a clean recording of someone with background noise, all you have is someone speaking in a quiet environment with background noise. Whereas in real life, when someone is in that environment, they will modify how they speak. 

A good example of that is if you enter a restaurant at 5 pm, and it’s really quiet, your voice will be much lower. You can even whisper and have a conversation. But as people come in, the noise gets louder, and you will then start getting a slightly higher-pitched voice, and the frequencies of your voice will change. 

We have to consider that because when you’re at a much higher speed in the car and the windows are open, the frequencies that you emit from your voice will also change. It’s not just the amplitude or that you’re speaking louder. The actual frequency range changes as well. 

Kane: So noise cancellation headphones work by inverting the signal, so the wave goes in the opposite way and cancels out. Is that right?

Andrew: There is a frequency range. Human voices are within a certain range of frequencies. Background noise, such as fan noises, AC, and the rumbling of a plane, are frequencies that don’t match up with the human voice.

Essentially, you can throw those out. As you said, it sends them back to cancel them out. But it’s a lot easier to identify those frequencies, remove them without interfering with the voice signal than it is from other sources of sounds.

This is like the non-linear elements where it comes in and out and covers some of the same frequencies. With noise cancellation headphones, it cuts out the background noise. But if you have a kid screaming next to you or a baby crying, those kinds of noises won’t be able to be canceled out because it’s much more difficult to do that.

Kane: Do you have to deal with a mask effect where some of the background noise is actually at the same frequency as the speaker?

Andrew: Yeah, essentially, those types of noises interfere with the speech recognition engine because there’s no easy way of canceling and removing them. It’s like reCaptcha on websites. There is a sequence of letters that you’re supposed to type in, but there are additional squiggly lines that have been added to make it purposely difficult. So, that’s the situation that we’re in.

We try to teach the acoustic model to recognize the words regardless of how much noise or the type of noise there is in the environment.

Kane: You mentioned that this is something that you can do on the fly, switching different acoustic models?

Andrew: Switching on the fly is generally for far-field and near-field because they are very different. There are challenges with near-field that you don’t get with far-field. So we mentioned convolutional noise, so that’s typically the noise that you’ll get with far-field.

For example, I’m speaking right now, and my voice is going directly into the microphone. So the sound waves are going in, but they’re also hitting the wall and coming back at the same time. So convolution is essentially like a reverb on the voice coming from the front.

It’s generally a separate model when it comes to near-field. The issue that you get is distortion. The reason why that’s difficult is you can distort everything and then teach the acoustic model to recognize the voice actions. But when it’s just one kind of phoneme that’s been distorted and the rest is normal, that’s where it gets difficult.

That’s what happens with near-field audio because you get too close to the microphone. So that’s why we have separate models. When it comes to all this background noise, we train the model on all of it, so it’s going to be able to perform at 65 mph, 30 mph, and in a parking garage. All of that noise has been included in the acoustic model. 

Kane: Is it possible to differentiate from different speakers, especially if there are other people talking in the background?

Andrew: There are different ways of doing that. One is an anonymous way of doing it based on spatial parameters. Whoever triggers it first will get all of the attention from the microphone array. So the microphone will then lock onto wherever that person is, and everyone else is ignored. This way is useful for multi-zone use cases.

There are other ways of specifically identifying a person. That’s generally the way core technology requires enrolling the users. If we’re thinking in terms of members of a family, you can ask each member of the family to say the wake word a few times, so when that person says the wake word, you actually identify and recognize them. That response is appropriate for whoever is actually talking to the voice assistant.

Kane: Voice is obviously leaving the home as it has been on mobile anywhere. More brands are going to be looking at putting voice solutions in place in stores, as they start to open up, and in those quick-service restaurants, and in all different kinds of environments.

Looking for more information on voice AI technology in every environment? Go to www.houndify.com or visit our blog [email protected] for more information.

If you missed the podcast and video live, or if you want to see it again or share with a colleague you can view the interview in its entirety here.

Be sure to join Kane Simms and Darin Clark of Soundhound Inc. in October as they discuss wake word detection and conversational AI. Subscribe to the VUX World newsletter to keep tabs on what’s coming next.

You Might Also be Interested In

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.