Mike Zagorsek on the Voicebot Podcast
Jul 02, 2020
7 MIN READ

What Does it Take to Build and Deploy a Custom Voice Assistant?

In a recent episode of The Voicebot Podcast, host Bret Kinsella, founder and CEO of Voicebot.ai, talked to Mike Zagorsek, VP of product marketing at SoundHound Inc., about what it takes to create and launch a custom, branded voice assistant. The podcast included insights into the current state of the voice AI industry, why the world’s leading brands are using custom voice assistants to add value to their brands, and advice on how to get started and what to avoid when working on a voice-first strategy.

The following is a quick recap and a few highlights of that conversation. You can hear the podcast in its entirety here: 

What it Takes to Deploy a Custom Voice Assistant – Voicebot Podcast EP 155

The opportunities and challenges of voice technology

Bret: What are the common threads or different threads that you’re seeing, having seen all of the different types of user interfaces over your career? 

Mike:  If you think about everything up until voice, it tends to be pretty task-oriented. And a lot of interface designers ask, “How do I make sure that the user understands what they can and can’t do?” And then once they understand what they can do, “How do we make it as easy as possible for them to do it?” The fundamental principles are really about shortening or making the time between thought, action, and outcome shorter. 

Voice in some ways is easier because it’s something that’s very natural to us. You don’t have to teach anybody how to use their voice. But that quickly becomes very problematic because we don’t simply use our voices to accomplish tasks. We express ourselves and convey emotion. I sometimes say if a picture is worth a thousand words, there’s a thousand ways to say the same thing, which makes it all the more difficult because there are fewer rules. With screens, there are natural constraints of how cluttered it can be and when you go to voice, it’s unbounded. Unbounded gives you maximum flexibility, but it does create other challenges.

If you’re on a platform like Alexa or Google, you typically have a service or an offering that is accessed through that third-party platform. So, you’re just making sure that the people who are speaking to Alexa and Google can also access your stuff.

With screens, there are natural constraints of how cluttered it can be and when you go to voice, it’s unbounded. Unbounded gives you maximum flexibility, but it does create other challenges.

But some companies are going to want to have a direct connection to their customers. Pandora is one of our partners and they’re a great use case for that because as smart speakers proliferate, music is one of the top three use cases. Pandora knew that they wanted to have a direct connection with their customers. The result was they voice-enabled their app with their own branded wake phrase, Hey Pandora.” As a result, they’re able to get access to direct user data, take control of the user experience, and use that knowledge to improve. 

Natural language understanding changes the game

Bret: I assume a lot of the people, maybe the vast majority of folks who are listening to this podcast are somewhat familiar with SoundHound. Why don’t you lay out the components of the solution and what you’re trying to accomplish? 

Mike: I would say at the core is the way we manage speech and natural language. If you really want to build a speech engine, you want it to process more closely to the way humans process speech, which is in real time.

As people are speaking, I’m always checking about what the context is. I’m not waiting for you to finish, pausing, and then trying to decipher what you said. And the challenge for co-founders of SoundHound Inc. was to build a platform that processes speech the same way. We refer to it as Speech-to-Meaning®. It’s processing what you’re saying in real time and that allows for speed and accuracy in two ways: 

Because our technology is already processing the speech as it’s being spoken, there is no lag in response. Whether you’re talking about weather, sports, getting navigation, or even just a general question, the system knows that’s what’s being asked even before the user is done speaking. That’s where the speed comes from. But then the accuracy really comes from the fact that because we know you’re referencing specific type of content, we’re more likely to get this right. 

We offer a platform that people can use and build off of, because we know that if we provide the tools, you can accelerate things a lot more quickly. In the early days, we were working primarily with large partners. But over time, we’ll be continuing to make the Houndify platform more self-service and self-sufficient and a lot of the tools that are currently being delivered through these large partnerships will become increasingly available to independent developers. 

And finally, we have a full stack solution—ASR, wake word technology, text-to-speech technology, and hundreds of domains. We can be that one-stop-shop, but not in a way that limits us. If partners want to work with elements of their own NLU and rely on our ASR, our speech recognition, that’s okay too. Because we recognize that there are going to be a variety of ways in which companies are going to want to build their own voice experiences. 

The role of content domains for voice assistants

Bret: One of the other things I think about SoundHound or how to define the product as being different is that you provide domain content. So maybe you could just talk about how that makes you a little different.

Mike: As a query gets made, our system will map the query to the appropriate domain automatically and we have content partnerships across multiple categories to help with that. That’s been really valuable because as we sign up with new partners who might have a narrow use case, they may also want to have access to a generalized assistant. 

Some examples. We recently announced partnerships with iHeartRadio and also RADIO.com/Entercom. So we now have the vast majority of radio stations available. And we added and voice-enabled the Big Oven recipe domain. The more domain and content partners we sign on board, the more domains and data sources our Houndify solutions partners have on hand. 

We have private domains and public domains. So for something in-car, like the Mercedes voice assistant MBUX, if you have a question that is about the Mercedes car manual, that’s a custom private domain for that car model specifically. 

Car companies are really transforming into technology companies and recognize that voice platforms are part and parcel of it.

We recently announced our partnership with Snapchat. They were really savvy about how they integrated the voice experience. The way it works is when you’re in the lenses section you pop open the camera and all the various filters and lenses are in there. That’s when you say, “Hey, Snapchat…” and by only making the wake word available in the context of that, it helps focus the experience so that people aren’t inclined to ask general purpose questions.

The success of Pandora’s Voice Mode

Bret: So Snapchat’s really interesting. Let’s contrast that with Pandora. 

Mike: Everything that Pandora does is voice-enabled. Pandora’s origins are radio, so the difference is you tell Pandora a song that you like and it uses the proprietary Music Genome Project technology to build a radio station based on visit, that song, and your preferences.

It tends to lend itself to things like, “I want to listen to this” or “I want to listen to that”. Then they extended it to moods. “Play me something relaxing” is a way to express a desire for music that doesn’t lend itself as much to a visual interface, because you’re not going to take every single mood and put it in a menu. You can just express yourself and the music plays accordingly. It’s a very powerful extension of music being an emotional service component.

But then it’s got the core stuff. You can say, “Thumbs up the song” and that particular station will play more songs like the one you just gave a thumbs up. 

Embedded, edge, and cloud connectivity

Bret: Do you expect to have a growing business around embedded only? 

Mike: We do have an embedded version of our technology called Houndify Edge.

Embedded allows you to get a voice response for a product, even if it doesn’t have an internet connection. We also have an optional cloud connection for a variety of things.

We’re starting to see a lot of interest in embedded voice control with an optional cloud connection. You get the benefit of immediacy for a narrow use case that’s command and control where the data component stays on the device—and it’s built-in privacy. 

We do have interest and momentum in that space and we expect to see most devices become hybrid where they will choose to just have some inherent functionality on device and then cloud functionality. 

As speech platforms have become more optimized and lighter weight, you don’t have to rely exclusively on the cloud to strike the right balance. 

Bret: When you think about appliances, people have these in the home for a long time, they tend to have very simple interfaces by design. Where do you see that headed? 

Mike: There’s the master-servant approach, which is you have a smart speaker in every room and every device is subservient to that speaker—a hub model. A lot of smart homes are headed that way. So you can program Alexa to enable this or Google to enable that, and users can control their lights through the speaker. It was easier for products that are already internet-enabled to simply take advantage of an API on a local network. But there are some devices that are going to want to have their own interfaces—mainly because they can control a lot more, customize a lot more, and maintain their brand. 

Embedded just allows companies to lower costs so that not everything has to be an internet-connected device and it gives them more control over the experience. So we see a future where every device will have a voice interface for people who want to talk directly to it. 

Bret: Yeah, that makes sense. So one of the areas that historically has been completely embedded, but more and more is now cloud and  hybrid embedded is automotive. How do you look at automotive differently from some of these other surfaces we’ve just talked about and where do you think that’s headed? 

Mike: In the next five years you’re going to see a real distinction between companies who were legacy versus the ones who recognize the shift. They’re building these new IoT products that happen to be cars. Traditional automotive companies need to catch up, get themselves connected, and have software updates. Car companies are really transforming into technology companies and recognize that voice platforms are part and parcel of it. 

Best practices for developing a VUI

Bret: Share with our listeners some of the best practices or some of the recommendations you have for how to succeed. 

Mike: The first thing is to have a strategy and to carve out the time with the right people in the organization to say, what is our point of view on voice interfaces? Not having a strategy is to guarantee being left behind.

Then the question is, through what platform do you want to deliver? If you want to become purely a content provider, then Amazon’s leading the pack to develop the skills, Google is certainly present, and we have partnerships on our platform through Houndify.

If you have a mobile app, is that the right platform for you to reach people? Or if you have a product, how are you going to move forward on it? 

Start to look at how you connect with customers and deliver value across channels to determine what your first step needs to be. 

Bret: And what about choosing the tech stack? What are the key things that I need to be thinking about today if I’m going to deploy my own assistant? 

Mike: It’s a build-it-yourself, or a build-by-partner strategy. It’s a question of a lot of investment in NLU and engineering support . Do I want to build my own domain using natural language understanding? And, then there’s the partner strategy of what can’t I do or what don’t I want to do that’s going to benefit from a partnership like one with SoundHound. 

It could be for parts of the experience—like ASR or wake word—or it can be the whole experience. And obviously the benefit of the whole experience is faster time-to-market and a complete end-to-end tech stack that will vault you into a competitive position much more quickly than if you were building it yourself. 

You know, our CEO Keyvan likes to say, “It often takes companies three years to realize it’s going to take 10 years to build their own voice platform.

Challenges of building an independent voice assistant

Bret:  Obviously, you’ve had some people come to you that either tried to build their own assistant. What is it that they’re most likely to struggle with? 

Mike: I think it’s really a question of expanding the use cases and providing the robustness that only a very deep, natural language model can actually support. But I think the motivation is really about controlling the end-user experience and having access to the data.

It doesn’t matter how good your technology is or how fast you are, if you can’t work together, it’s dead before it starts—so find the right partner.

I think that if you can work with a partner that really ensures that you have that ultimate control, you’re not reinventing the wheel, and you maintain a degree of control over the user experience without having to over invest in something that has been solved.

Partnerships are central to any successful implementation. It doesn’t matter how good your technology is or how fast you are, if you can’t work together, it’s dead before it starts—so find the right partner.

More on the latest in voice AI

During the podcast, Bret referred to a popular episode of his podcast in which he had interviewed SoundHound Inc. CEO Keyvan Mohajer (episode 41).

And more recently, Bret joined us as part of a series of interviews for our video guide: Road to Recovery: Fueling the Next Wave of Innovation featuring timely advice for brands on how to navigate the current crisis. See what Bret had to say about “How Voice AI Can Help Brands Respond to New Consumer Priorities” and “How the Pandemic is Driving the Shift from Touch to Talk”

Here’s some additional reading on the impact of voice AI on a world reshaped by the COVID-19 pandemic, and some insights on how innovation will help lead us forward:

Developers interested in exploring Houndify’s independent voice AI platform can visit Houndify.com to register for a free account. Want to learn more about our company or our solutions and technology? Talk to us about how we can bring your voice strategy to life.

Karen Scates is a storyteller with a passion for helping others through content. Argentine tango, good books and great wine round out Karen’s interests.

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.