voice recognition in a noisy environment
Feb 11, 2021

How to Design Voice Assistants for Noisy Environments

There are many reasons speech recognition systems in voice assistants fail under noise. The first and easiest to fix is the position of the physical microphone. When only one microphone is present, the result is similar to a person listening to a conversation with just one ear. Voice AI systems with multiple microphones mimic the human brain’s ability to separate sounds coming from different directions and can focus on the sound from a single source.

Other elements of a noisy environment aren’t as easy to correct and not all solutions work for each type of noise interference. In general, noise is very difficult for speech systems to handle and requires various methods to reduce.

Common types of noise that obstruct speech recognition systems include:

  • Additive noise
  • Convolutional noise
  • Nonlinear distortion

Additive noise

Additive noise is a fan noise, a vacuum, an air conditioner, or a baby crying in the background. These are called additive noises because they are combined with the target speech signal at the microphone level where the sound waves superimpose one atop the other.

To turn sound into meaning, the speech recognition system will need to extract the target speech signal from this combination.

Convolutional noise

Convolutional noise—or convolutional distortions—refer to the reverberation introduced by enclosed spaces. When someone is speaking in an enclosed space, the sound waves bounce off surfaces, such as walls, before reaching the microphone. The resulting sound recorded at the microphone will be colored and echoey or reverberant. The bigger the enclosed space, the more reverberant the recorded sound will be.

Nonlinear distortion

Nonlinear distortion happens when the speaker is too close to the mic, or the sound on the device is set too high.

Training speech recognition systems on a lot of data—including data that contains all the different speech sounds and combinations of sounds in different contexts and environments—is critical to designing a voice assistant that responds accurately.

If your voice assistant will be used in an environment with a lot of reverberation and you never address this type of distortion in the training data, your voice model will not be able to handle signals for your most likely use cases. 

In general, noise is very difficult for speech systems to handle and requires various methods to reduce.

Knowing who your users are and where they will be using your voice assistant is key to designing a viable voice user interface—and surprisingly one of the elements often skipped or cut short by developers. Here are 6 tips and best practices to help create clean signals and improve the accuracy of your voice user interface.

1. Know your user’s environment 

When designing a speech system—such as a smart speaker or a voice activated toy—don’t underestimate the effect of the distance between the user and the microphone. 

When the user’s voice reaches the microphone, it won’t be the only sound received. Due to reflections caused by sound bouncing off obstacles or surfaces in the user’s environment, voice commands will be recorded along with these reflections. As a result, the speech recognition system will receive an echoey signal—making it difficult to process. 

As the user moves away from the microphone, the energy in the direct path decreases and the signal is harder to recognize. Speaking louder doesn’t help, as it also causes the sound reflections to be stronger—further masking the user’s intent.

2. Choose the right microphone 

To reduce adverse effects of environmental noise, begin by choosing the right microphone. Most importantly, you’ll want to select a microphone that has good directivity towards the speaker. If the microphone is pointed toward the speaker, noise sources and reverberation coming from other angles towards the microphone will be lessened.

There are some traditional analog microphone capsules that have really good directivity. In addition, Micro electromechanical devices (MED) are commonly used in smartphones, laptops, and similar devices. These microphones are produced as part of a silicon chip. They’re very small, lightweight, and quite inexpensive.

MEDs are omni-directional, which means that sound can hit them from any angle and they can perceive it. While that doesn’t sound like a good idea, using several of these microphones to form an array will allow you to focus on a single direction and reduce noise coming from all the others.

3. Choose linear noise reduction components

If you want to apply noise reduction, make sure it’s a linear component. Nonlinear noise reduction systems can undermine speech recognition systems, making it even harder to process the speech signal.

Traditionally, noise reduction algorithms were built for human perception. Because the speech recognition system is not the same as the human auditory system, these algorithms aren’t always good for voice assistant development.

Nonlinear noise reduction systems can undermine speech recognition systems, making it even harder to process the speech signal.

When you apply aggressive noise reduction to voice systems, two adverse effects can occur: speech deterioration and speech signal deletion.

Instances of speech signal deletion include the random disappearance of certain, if not all, frequencies across time. In the worst case, this would be similar to a sentence where only parts of the words are pronounced—confusing speech recognition systems.

In some cases, musical noise or artifacts can occur. The noise reduction popping on and off in certain frequency bands and at certain points in time can lead to bits of sound that remain after everything else is removed. They look like sparkles on a spectrogram and sound like random tones, but they don’t occur in our natural speech and speech recognition systems don’t know what to do with them.

4. Add zone control with source separation 

Voice separation allows voice user interfaces to respond to voice commands from various places in a room or in a car. For example, if we put several speakers in different places in the car, the driver, the front-seat passenger, and backseat passengers would all have access to different functionalities of the voice assistant. 

When the driver speaks and the passengers in the back are making some noise, the separation algorithm is able to split up the various microphone signals so that the driver’s voice signal is cleanly recognized.

The passenger voices can also be detected as clean signals, separate from the driver. This is a huge achievement because each individual microphone picks up everything that’s going on in the car. 

Voice separation allows voice user interfaces to respond to voice commands from various places in a room or in a car.

Putting separation algorithms in place in conjunction with multiple microphones, enables things like zone control . Anyone in the car can say, “I am cold,” or “Play some music,” and this command will be executed in the zone where that person is located. You can also apply zone restriction mechanisms. For example, restricting passengers in the back from opening the trunk by voice, whereas the driver could be permitted to do so.

5. Include more data in training models

Including more data in the training and ensuring that the training system has seen the kinds of disturbances that can occur in real life are key to making speech recognition systems more robust. For example, if there are specific noises in a hospital environment or in a car environment, you want to capture these noises and use them to augment clean recorded speech. Recording speech in your users’ environments provides accurate data for training your neural networks.

Be sure you’re collecting statistically representative samples and that the noises you collect cover a wide range of disturbances that can occur in practice. If you choose only one specific type of noise, your models might be overfitted—performing very well in the presence of this noise, but failing with others.

Recording speech in your users’ environments provides accurate data for training your neural networks.

Adding reverberation to your speech recordings can be especially challenging. You’ll need to reverberate a lot of speech samples covering a wide range of phonetic variations. The reverberation will actually smear the voice sounds for some time adding in long-term dependencies and resulting in blocks of speech sounds that will be trained into the model. Therefore, be sure to have many different blocks represented in your data.

Using algorithms like noise reduction, echo cancellation, dereverberation, and beamforming, introduces artifacts and speech degradation into the system. To make your speech systems work well with these artifacts, apply signal processing algorithms to the training data and in production.

6. Allow for interruptions and barge-ins

One of the trickiest use cases for voice recognition systems is when the system is talking and users want to interrupt, or barge-in. When the voice assistant is talking and the user wants to say something, the synthetic speech of the system will leak back into the microphone. 

As a result, the system will always hear a little portion of its own speech. When a user interrupts the system, their speech will superimpose with the residual echo of the system, confusing the voice assistant. In order to allow for voice-barge in, you have to first put echo cancellation in place. Echo cancellation makes voice user interfaces able to listen while playing music or speaking.

When the voice assistant is talking and the user wants to say something, the synthetic speech of the system will leak back into the microphone

The purpose of echo cancellation is to remove the signal emitted by the loudspeakers after it has been distorted (e.g. reverberation, nonlinear distortion of the loudspeaker) and has travelled back to the microphone.When properly tuned, echo cancellation can attenuate most of the music or the speech that’s leaking back from the loud speaker into the microphone. 

Although you can greatly reduce the echo, it’s impossible to remove it entirely using echo cancellation technology—as residual echo will always remain. If you want to use an echo canceler for speech recognition systems, make sure it’s using linear signal processing.

You’ll also need to implement a kind of cutoff policy that tells the system to stop talking when it detects a barge-in by the user. The system must have a way of determining whether a barge-in is really a barge-in, or if someone is just sneezing or coughing in the background—in which case the system should not stop talking.

One option is to implement a system where the voice assistant stops talking when it detects the barge-in, but then goes back and picks up where it left off, if it finds that no one was actually speaking. 

Why do voice assistants still mis-recognize speech so frequently?

Even if you’ve put all the signal processing in place, you’ve done a really great job training acoustic models with augmentation—with all kinds of distortions and disturbances—you still might not see good performance in the end.

Why is that? Speech is fuzzy, it’s vague, and It can be ambiguous. What we need in order to make sense of speech is context. Context helps us resolve difficulties in understanding.

Imagine a child at the age of about eight who is able to understand pretty much everything that grown-ups say, but not from a meaning perspective. At that age, they don’t have the context or the mental concepts of what is being talked about. If the conversation is about politics, for example, children will understand the individual words, but not the meaning.

What we need in order to make sense of speech is context.

The situation is similar with voice assistants. If the voice system lacks knowledge of a particular domain, the voice assistant will not be able to make sense of it. If it doesn’t have a good concept of the subject matter and the context of what’s being talked about, the voice assistant will make really weird recognition errors—causing user frustration and confusion. 

Make sure that—in addition to all the signal processing and acoustic model training—your system has a good representation of knowledge in the form of content domains.

Developers can explore Houndify’s independent voice AI platform at Houndify.com and register for a free account. Want to learn more? Talk to us about how we can help you bring your voice strategy to life.

Blog based on an interview with David Scheler.

You Might Also be Interested In

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.