Victor Leitman
Dec 08, 2020
7 MIN READ

Embedded, Cloud, and Hybrid Connectivity Solutions for Custom Voice Assistants

When voice AI is connected to the cloud, users have the ability to ask questions on a range of subjects including weather, local search, stock prices, sports scores, news, restaurants, parking information and so much more. But what about when a device doesn’t need to deliver such a breadth and depth of information?

Does that mean the device must be tethered to a cloud-based third-party voice assistant in order to deliver a convenient and hands-free user experience? Not at all. In fact, there are many types of branded voice assistants and voice user interfaces that can be customized for specific, limited use cases, product specs, and output requirements.

Depending on the level of functionality required from the voice assistant and the amount of memory and processing power available from those devices, manufacturers can choose to voice-enable their products with connectivity options ranging from fully-embedded (no cloud connectivity at all) to hybrid (partial cloud connectivity) to cloud-only, which requires 100% connection to the cloud. 

For many manufacturers, adding an embedded voice assistant is a relatively easy way to unlock functionality and deliver hands-free access to users. Operating a thermostat, a smart refrigerator or TV, or a smart medical device doesn’t necessarily require cloud connectivity. These types of devices can employ an embedded voice AI solution and deliver all the functionality a user would need for the purposes of those limited use cases.

For many manufacturers, adding an embedded voice assistant is a relatively easy way to unlock functionality and deliver hands-free access to users.

In other cases, a hybrid voice assistant can deliver an always-on, responsive voice assistant that can provide device or product control while delivering timely and accurate information from the cloud. 

From cloud-only to hybrid connectivity for voice AI

Car manufacturers usually want a hybrid solution—a voice assistant that is able to process the user’s query locally in the offline mode while simultaneously sending the query to the cloud and giving the user the best response available.

A hybrid solution is particularly desirable for in-car use cases since the vehicle may move from areas with connectivity to those without. Employing a hybrid solution requires an automatic arbitration model that can decide where the appropriate information resides—in the embedded voice technology or in the cloud—and how to return the most accurate response.

In most cases, cloud-based speech recognition is more robust and more accurate than embedded speech recognition. In a hybrid voice assistant model, if the cloud responds fast enough with an accurate result, this result is immediately returned to the user. However, if the cloud doesn’t respond within the specified time, or the response is not good enough, then the response from the embedded software is returned. 

For instance, if someone is driving into the mountains and the connection drops, you still want the voice assistant to recognize queries like, “Turn on the air conditioning,” or “Roll up the windows,” or “Close the sunroof.” All of these types of queries can be handled in the offline mode.

But queries like, “When does the next flight arrive in San Jose from Tokyo?” or “Find me the nearest Thai restaurant with atleast a 4-star rating,” are handled exclusively in the cloud. In these cases, responses are only available when cloud connectivity is present.

The beauty of the hybrid solution is the ability to provide always-on connectivity and offer responses and product functionality even when cloud connectivity is not available. 

Ideally, we want cloud and hybrid interfaces (APIs) to be very similar. This way, developers can work on the user interface by using the cloud, while integrating an embedded voice experience.  When all components are ready, switching from cloud-only connectivity to a hybrid solution is only a matter of changing the URL in the configuration.

Embedded voice technology for on-device functionality

In addition to cloud-connected voice assistants and those assisted by embedded technology, there are embedded-only voice assistants that don’t rely on the cloud to deliver on their promise. If your product needs to recognize only a few hundred phrases or maybe a few thousand phrases, an embedded voice assistant may be your best solution. Although it can still be powerful, an embedded voice assistant requires a lot less computing power than a hybrid solution.

Unlike a hybrid solution, which requires a relatively powerful board with GPU (e.g. Nvidia Jetson family), an embedded voice AI can use less expensive boards with just enough power for offline speech recognition (e.g. Raspberry Pi).

If your product needs to recognize only a few hundred phrases or maybe a few thousand phrases, an embedded voice assistant may be your best solution.

Since the introduction of voice assistants for the home, the topic of privacy and security have been barriers to adoption. Recently, there has been a lot more conversation around the topic of privacy, especially in Europe, and people are very concerned about data getting propagated to the cloud.

For applications where privacy is a concern, offline speech recognition can help allay concerns as no internet connection is needed for the voice assistant to operate. Again, you can think of voice user interfaces for smart thermostats, smart TVs, and even in the car. 

The biggest trade-off for implementing embedded technology is the robustness and breadth of responses available. Although we provide a good rate of accuracy with our embedded technology without consuming too much memory, most embedded systems don’t have enough computing power to provide answers to complex and compound queries. 

Embedded voice AI in noisy environments

For an embedded solution to be adapted to noisy environments, it’s best to know ahead of time what kind of noise the voice assistant is going to experience. For example, will it be used in  public spaces, shops, train stations, or will it be in a home next to a washing machine or in a car?

No matter the environment, we can create the custom acoustic models and work with the manufacturer to ensure their microphones know how to handle noisy environments. 

That’s not to say that a particular microphone is required for voice AI to work. While it’s beneficial for accuracy to have a noise-free input audio stream, in many cases manufacturers often have already chosen their microphone provider. We are able to handle all kinds of noise environments including car, airplane and cafeteria noise, and we are willing to work with all of them to provide the best speech recognition experience possible. 

Challenges of implementing an embedded or hybrid voice solution

Sometimes the greatest challenges don’t lie in the choice of connectivity options. Often, we find the greatest challenges come from preconceived notions of what voice AI is and what it can do.

Surprisingly, a lot of the clients I meet are still stuck on things from the past. They still think that people can’t talk to their computer in a natural human voice and they try to phrase the queries to match short keywords—”deep, hot, cold, gray.” They don’t try to talk to a computer naturally like saying, “Please make me a hot Earl Grey tea”. Instead they say “Tea, earl grey, hot”, as though Star Trek is the only language model for speaking with voice assistants. 

I think people got used to the old models and speech recognition systems and they haven’t modified their expectations based on newer technology that’s currently available.

In call center applications for example, callers are still asked to choose from a menu of voiced selections, such as “Please press one in order to be connected to a representative.” or “Please press two in order to hear the balance of your credit card,”  and so on. Surprisingly, I still see a lot of designs coming from that place, trying to break things into very segmented intents without taking into account that human speech can be much more complicated than that.

Instead, developers should be thinking about designing voice assistants to respond to their customers in a much more human way. For example, “Please tell me my current credit card balance and when my next payment is due,” or “Close all the windows, turn on the air conditioning, and tune the radio up to 88.5 FM.”

The answer to many natural language queries does not come from a single source. Instead, the data is gathered from across domains to deliver a single response. I think it’s important for developers to start thinking about designing the system from the point of view of talking to a fellow human, instead of initiating responses based on single keywords.

Afterall, when someone gets into a taxi or Uber, they don’t start by telling their driver, “Navigate to  New York,” and the driver doesn’t respond, “OK, and now please tell me your street name,”  and then, “Now please tell me your house number.” The passenger simply wants to be taken to a specific destination and should be able to express that in one statement and the driver already knows which city they’re in.

I think it’s important for developers to start thinking about designing the system from the point of view of talking to a fellow human, instead of initiating responses based on single keywords.

Once we start thinking about talking to products and devices the same way we will talk to another person, the whole system becomes more natural. My advice is to implement a natural voice experience now and forget about how the old system worked or how we thought it should work.

At SoundHound Inc., we have all the tools and expertise needed to develop a voice assistant ranging from the always-on voice control and maximum privacy of embedded voice interfaces, to a cloud-only option for the broadest capability with minimum hardware impact, to a hybrid solution that combines the control of embedded with the capabilities of the cloud.

Explore Houndify’s independent voice AI platform or register for a free account. Want to learn more? Talk to us about how we can help you bring your voice strategy to life.

Victor Leitman headshot

Victor Leitman is Director of Embedded Engineering at SoundHound Inc. Growing up in Estonia, then moving to Israel and later to the United States, made him realize the importance of clear communication. The SoundHound embedded team allows him to combine his enthusiasm for languages with his passion for engineering.

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.