HERE Directions virtual event
Nov 09, 2021

How Voice AI Evolved into the Superior User Experience It is Today

Voice assistants may be an efficient, fast, and convenient way of interacting with a variety of devices today, but they didn’t always have these capabilities. Even early attempts at voice assistants would be subjected to endless jokes about their poor accuracy and understanding. So how did voice AI become the conversational, intelligent technology it is today?

While most modern voice assistants began to be released around 2011, voice recognition technology attempts actually go back much farther, with IBM releasing the first digital speech recognition tool in 1961. Since then, voice AI technology has evolved to sound natural, understand context, access thousands of domains, and voice-enable a variety of devices. According to Adobe, 94% of consumers believe voice technology even improves the quality of life. 

Voice AI has become an integral part of how consumers interact with technology and go about their daily lives, but it took many steps to achieve such a technological evolution. As part of the HERE Directions Event, SoundHound’s COO, Mike Zagorsek, spoke with HERE Technologies about the evolution, successes, and future of voice assistants. 

Below is a summary of their discussion. View the entire HERE Directions video here, which is available until November 21st, and learn more about the HERE Directions event here

How voice assistants began

The first versions of voice assistants were all running on local machines, whether it was small, personal computers or something larger with servers. Then, two innovations launched it into the realm of voice assistants—speech recognition and the evolution of voice as a way to interact with devices.

Two innovations launched it into the realm of voice assistants—speech recognition and the evolution of voice as a way to interact with devices.

One important innovation was speech recognition, where what the user is saying is captured into text. Also called ASR (Automatic Speech Recognition), speech-to-text, or text-to-speech, meaning that the text that shows up on a screen can be read out loud. In large part, a world of possibilities opened up predominantly with dictation and speech recognition and the need to develop accuracy models. 

However, these early attempts were all manual, labor-intensive, and error-prone. The accuracy wasn’t particularly strong, but it was a breakthrough because, suddenly, voice became an input device into the computer. 

In the nineties, we started to see phone systems or IVR (Interactive Voice Response Systems). Other forms of dictation and other products were released into the early 2000s until we started to get into more of the mobile and cloud-based voice recognition systems we’re seeing today.

Breakthroughs in voice AI technology 

After attempts at automatic speech recognition and IVR, the technology was catapulted into the voice AI we know today when three key breakthroughs propelled us into the realm of AI:

  1. Cloud processing
  2. Natural language understanding
  3. Machine learning

With the shift to cloud processing, or IoT devices, voice assistants no longer required the local device to process the speech and provide responses. This evolution eliminated the need for computing power. 

With the shift to cloud processing, or IoT devices, voice assistants no longer required the local device to process the speech and provide responses.

Then, with adding natural language understanding, the voice assistant did not simply transcribe the text but also tries to understand what the user is saying and provide a response that is accurate. If it’s cloud-based, users can also get a response in real-time based on data. 

The third breakthrough is pushing forward into the realm of machine learning. Speech recognition is very powerful because pattern recognition matching to determine the correct interpretation of what a user is saying is greatly facilitated by machine learning-based algorithms.

If the user says, “Ice cream” or “I scream,” understanding how it exists in the context of other words is essential so the device can make sure that it’s being transcribed effectively. This process favors a lot of data accumulation, pattern recognition, and more effective transcription.

Voice technology becomes the norm

Users have interacted with technology in a variety of ways and modalities over the years. At first, there was an interface with the keyboard, which evolved into the use of thumbs for typing on mobile phones. Voice is a novel way to interact with technology. 

It’s a new channel for creating things, and humans have a way of always trying to move forward, not backward. Voice has now unlocked people’s imagination and potential. Users understand that there are things that they can do with voice that they couldn’t do before. Brands also are beginning to ponder what the limit is to voice AI, and it really comes down to comfort, repeatability, and predictability.

People are generally willing to try something once or twice. If they have success, they will adopt it. If it doesn’t work, they’ll shy away from it. Daily, popular use cases of voice assistants, such as playing music or setting timers, are here to stay. For a lot of customers, that’s an incremental step. They’re now doing the things they used to do with the touch screen with voice. It’s a category shift because now they’re in an entirely new arena, and it’s up to the brands to keep evolving and broadening the utility of use cases.

Users are now doing the things they used to do with the touch screen with voice.

The next generation of voice AI solutions

As we enter the voice-first era and with more companies investing in voice AI technology, brands have to consider how to differentiate their voice assistant and stay ahead of the competition. 

Here are 4 categories of innovation that the voice AI industry is moving toward:

  • Improving accuracy 
  • Developing emotion detection
  • Expanding multilingual and accented language capabilities
  • Delving into monetization

The continued effort to filter background noise and understand people in noisy environments is essential for voice assistants. Noise disrupts the speech patterns that are being picked up by the microphone. The ability to remove that opens the door for interacting with the voice assistant in a variety of environments, such as cars, on the street, or in areas with a lot of background noise. It also is a solution for addressing multiple speakers. If three or four people are speaking, the voice assistant needs to make sure that it’s only picking up the voice of one speaker through person identification. 

Then, there are environmental innovations that are constantly being improved upon and explored, such as emotion detection. When a user is speaking, the tone of voice might indicate the user’s state of mind. If voice assistants can pick up the emotional intent of the user, then the voice AI can respond and adjust for it. If the user is angry, the voice AI can become calmer. This would be especially useful for situations where customer service is essential, such as call centers and kiosks in QSRs or hospitality.

If voice assistants can pick up the emotional intent of the user, then the voice AI can respond and adjust for it. 

In addition, when looking at speech technology providers, it’s important to ask how many languages they provide and if they have the underlying technology to support multiple languages and accents. Regional accents are also almost entirely a different language. A Canadian speaking French is going to sound differently than an American speaking French. Multiple languages and accents are two issues that voice AI platforms need to constantly address and evolve for.

Finally, there is commerce and monetization, the ability to combine those two to provide users with opportunities to discover the right solutions that are mutually beneficial. Through non-intrusive voice ads, brands can prove a return on their voice investments, and users can find real value through the right suggestions. 

Interested in learning more about the history and evolution of voice AI? Watch the entire HERE Directions video here until November 21st and learn more about the HERE Directions event here

At SoundHound Inc., we have all the tools and expertise needed to create custom voice assistants and a consistent brand voice. Explore our independent voice AI platform at and register for a free account. Want to learn more? Talk to us about how we can help bring your voice strategy to life.

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.

Subscription Form Horizontal