Natural Language Engineer
Oct 22, 2020

How to Teach Your Voice Assistant to be Multilingual

Until two years ago, voice assistants mostly understood only English, even though it’s only the third most spoken language in the world. Giving early versions of voice assistants an English-only voice seemed natural at the time, since American companies designed and developed these products. Similarly, voice assistants targeted at specific markets or developed in other countries such as China, were programmed to speak the most common dialects in those geographic areas, without being multilingual.

Mono-linguistic voice assistants left global companies with a few choices:

  • Hope that all their customers could speak the language of the assistant well enough to interact with their products
  • Develop individual voice assistants for each country
  • Look for voice AI technology providers that could train a voice assistant to be multilingual 

While the third option seems to be the most logical one, challenges arose because training a voice assistant to understand new languages is not as easy as it may first appear, especially when NLU and ASR are treated as separate functions.

Even when advanced voice AI technology is integrated, creating multilingual voice assistants is more than just providing translations from English to the language of choice. Localization—including targeted content and formats—is essential to ensure the voice assistant responds in a culturally-acceptable manner, and can understand speech in the context of differing rules of grammar.

Even when advanced voice AI technology is integrated, creating multilingual voice assistants is more than just providing translations from English to the language of choice.

Before deploying a voice assistant to your product, service, or app, developers and brands need to understand more than just the need to address a larger audience. Taking shortcuts and skipping critical steps could result in a voice assistant that doesn’t meet the needs of any audience. 

We’ve outlined the 5 key steps to successfully developing a multilingual voice assistant:

Step 1 – Making some critical voice strategy decisions 

Before embarking on the journey to voice-enable a product, service, or app, brands must make critical decisions about how many languages their custom voice assistant will be able to speak fluently. 

For instance, there is a big difference between a voice assistant that speaks both English and Spanish fluently and one that can speak English with a degree of understanding of some Spanish words or names. For instance, if you live in California, your English speaking assistant will need to understand non-English words like El Camino Real, Jalapenos, Pacifica, and Vallejo. Even with those capabilities, your voice assistant would not be considered truly bilingual.

There is a big difference between a voice assistant that speaks both English and Spanish fluently and one that can speak English with a degree of understanding of some Spanish words or names.

If your use cases don’t exceed a low level of language understanding that includes only names and places, you may not need a truly multilingual assistant. However, if you have customers in areas where English is not the primary language, or where many languages are spoken in close proximity to each other, you may want to consider adding more languages to your voice assistant’s library of knowledge.

The geographical location or multiple locations where your voice assistant will be used will play a major role in that decision-making process. For instance, it makes sense for an assistant deployed in North America to speak English, Spanish, and possibly French. For voice assistants that would be accessed by users in Europe, Asia, and Africa, more languages and dialects would most likely be required. No matter which direction you decide to go, making those decisions early in the voice strategy process will help save a lot of time and wasted resources later.

Users may not initially adopt your voice interface if the assistant doesn’t speak their language. Convincing those users to switch to a voice interface later is a lot harder.

Although it’s always possible to add a language later on, it’s best to decide as early as possible to avoid confusion with your users. I like to compare the challenge of user adoption after a major shift to the time when Microsoft released Windows in local languages years after the initial introduction of the operating system. Although users adapted, the transition was difficult.

In addition, users may not initially adopt your voice interface if the assistant doesn’t speak their language. Convincing those users to switch to a voice interface later is a lot harder and requires more effort than onboarding them with a voice assistant that speaks their language from the beginning.

Step 2 – Figuring out what’s important to your brand and customers

Once you’ve decided to offer a multilingual voice assistant, there are several key considerations, including:

  • What is my market?
  • What are my competitors doing? 
  • Who are my customers and what do they expect?
  • Where is the assistant going to be used? 
  • How will my assistant be used? 

Knowing your audience and understanding the context of the voice queries your customers are most likely to ask are key to every decision you make about how you will implement a custom voice assistant. Besides which languages to include, tone, personality, and style are all speech considerations that go into creating a voice assistant that accurately represents your brand and values.

For example, if your assistant is going into a car your users may expect some degree of formality and authority. On the other hand, if your assistant is going to be used at home, you can expect people to speak in a less formal way and your voice assistant should mirror a more casual tone. Additionally, depending on the country where your voice assistant will be used, decisions about formality and tone may be fundamental to its success. 

For example, in Arabic countries, you can expect a more formal interaction with your in-car systems, and will want to teach your voice assistant to use Modern Standard Arabic. In a more widely consumer-based application, like a voice-enabled mobile app, choosing a local dialect like Egyptian in Egypt and Algerian in Algeria are better choices. Similarly, in smaller countries where English is widely understood—like Sweden and Norway—it would be acceptable for an in-car assistant to speak in English, but consumers would expect a mobile app to speak the language of the country.

Step 3 – Follow proven best practices for training your voice assistant

Setting up your voice assistant to deliver results in multiple languages begins the same way as developing any aspect of a voice user interface. It begins with data. Data, data, data, and more data. 

For anyone developing a machine learned model, I would recommend trying different values and gradually changing one parameter at a time. Experiment with making changes down or up to values that would seemingly not make sense. This will give you interesting data points and a good understanding of the behavior of the system.

Avoid the temptation to rely on your intuition too much. Instead, let the data speak for itself. Your  initial intuition may lead you to a hypothesis to explore, but in the end any assumptions must be validated with tangible results. If the results are not there, that’s probably not the best solution.

Avoid the temptation to rely on your intuition too much. Instead, let the data speak for itself. 

Even making one small change can deliver a slightly better result. From there, another change may lead you to an even better result. Combining those two changes may improve the result or actually result in a worse outcome. While that may be counterintuitive, allow the data to lead you and be open to accepting unexpected results and continue to make changes that result in better outcomes. It’s an iterative process and should be ongoing, even after the voice assistant is deployed.

Step 4 – Avoid common misconceptions and missteps in VUI design 

Conflicting data sources

Know your data.The more data the better. Although this is true, some data sources can pull your results downwards. If that happens and you have different training data sources, try training with each source separately. Then, combine them and check to see which set performs best. Adding data, even of lesser quality, will still yield better overall results in most cases. If you continue to get disparate results, you might want to consider eliminating the lesser data source to avoid confusion.


Another common error results from overfitting. To avoid this common error, we keep 10% to 15% holdout data for testing. This is very convenient as it’s test-ready data. In addition, it’s also useful to have independent test data that can help uncover misses in the training data or overfitting your model. To give a concrete example, if we have a phonetic transcriptions dictionary with no alternative transcriptions for Dutch words ending in -en or Swedish plurals ending in -or we will not only never train to generate these alternative transcriptions, but we will not even know that we are missing transcriptions of valid pronunciations. In both these cases having independently compiled test material will help to catch these omissions and will ultimately tell you more about how your model is performing in comparison to “real life” data.

Eureka moments

Keeping an open mind can lead you to some discoveries which become “Eureka” moments—those times when you realize you’ve discovered something you hadn’t anticipated or were even looking for. They are really like little treasures and they often come from looking at results. When you begin testing, you may have an idea of what the results should be. But when you look at the result and it’s not what you were expecting, don’t dismiss it. Use it as a “Eureka” moment to learn something new. Look at the result or the evolution of different data sets and draw conclusions from what you see, not what you expected to see.

Once you’ve had a Eureka moment, use your knowledge to design new experiments that will give you a hint as to how to use the system better, determine a new direction, review old data that led to a deadend before, and result in a clearer picture overall.

Step 5 – Find the right voice AI platform partner

Solving for languages in a voice assistant is more complicated when natural language understanding (NLU) and automatic speech recognition (ASR) live as two separate components of the same system. In this traditional voice AI technology configuration, it takes a long time to build the query-understanding module for a new language, and there are many opportunities for errors. 

Advanced voice AI platforms like Houndify have eliminated many of the challenges of multiple language acquisition for voice assistants with speech engines that process speech in real-time, closely approximating how humans think and speak. By eliminating the traditional two-step process of speech recognition and response, and by providing an existing library of language support, brands can easily and quickly build customized multilingual voice assistants with the Houndify platform.

Developers interested in exploring Houndify’s independent voice AI platform can visit to register for a free account. Want to learn more about our company or our solutions and technology? Talk to us about how we can bring your voice strategy to life.

Monika Depeyrot, SoundHound Inc.

Monika Depeyrot is Senior Natural Language Engineer at SoundHound Inc. She is passionate about discovering how the mechanics of human language can be represented in a way that computers can understand. She also loves hiking and the outdoors.

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.

Subscription Form Horizontal