Automatic Speech Recognition
Apr 28, 2021
7 MIN READ

How Automatic Speech Recognition Works to Improve Transcription Accuracy

Accurate transcription can be the difference between, “The meeting has been canceled and is scheduled for next Thursday,” and “The meet in today has been capital an is skidoo for next Thursday.” While you may get the gist of what has been said, the accuracy has been lost.

In the healthcare, legal, finance, and even academic and media industries, clear and concise communication is paramount. In other instances, slight mistakes in transcription don’t mean life or death, but they can be annoying. Afterall, if you have to refer to the recorded content to understand what’s been said, the convenience of a transcript is significantly lessened.

The frustrations and miscommunications caused by imprecise transcriptions have long been tolerated due to the lack of any real alternatives—or the perception that there are no other options. Yet, advances in voice AI technology, including acoustic and language modeling make highly-accurate speech-to-text transcription possible today.

If you’re still working with legacy speech-to-text transcription platforms, you may not be aware of the advancements in Automated Speech Recognition (ASR) technology that could turn your “just ok” transcriptions into error-free and accurate documentation.

Word recognition accuracy

Accuracy at the individual word level has been greatly improved in the last few years through machine learning technology and the integration of Natural Language Understanding (NLU) components. Neural network-based ASR now has the ability to transcribe complex speech with greater precision.

If you’re still working with legacy speech-to-text transcription platforms, you may not be aware of the advancements in Automated Speech Recognition (ASR) technology that could turn your “just ok” transcriptions into error-free and accurate documentation.

The advancements in ASR technology are further enhanced by the ability to include custom vocabulary. That means that the words and phrases relevant to your industry or individual business can be included, allowing users to speak just as they would to another person in the same environment—using the terms and lexicon unique to your industry or company.

Machine learning infrastructures in ASR engines can support language libraries with millions of words, including specific vocabulary, acronyms, and proper nouns. These neural network-based speech recognition models increase word accuracy as more specific training data is collected and added—creating language understanding for specific contexts.

For instance, the ability to understand medical terms and prescription details is critical to the healthcare market. Precise electronic health records,  clinical documentation, and transcription are essential to delivering unambiguous patient care instructions and electronic health records. Besides the medical industry, financial transactions, legal proceedings, and even construction reports require precise terms, values, and measurements to be recorded accurately. 

Machine learning infrastructures in ASR engines can support language libraries with millions of words, including specific vocabulary, acronyms, and proper nouns.

For these industries, and more, advanced ASR engines have evolved to include not just the custom vocabulary, but custom pronunciations and automatic punctuation. While this may seem unnecessary, knowing where one statement ends and the next begins can change the meaning of what is being said. The automatic inclusion of periods, commas, and questions marks in the transcript increase the level of accuracy and comprehensibility of the written word.

Noise filtering and speaker identification

Rarely are speech-to-text transcriptions recorded in an environment without other sounds. In most real-world scenarios, there are multiple speakers, ambient noises, interruptions, and echoes caused by sound bouncing off the objects and walls in the room—creating unique challenges for ASR technology.

These challenges are being met through a variety of technical advances and design best practices. By augmenting the training data with the unique characteristics of the user’s environment—such as the whirring of medical devices, background office noise, or other people talking, ASR models are better able to isolate the correct sounds to record and transcribe.

Speaker identification tools and speaker verification features determine who is speaking and verify if the speaker has been granted the proper permissions to add to the record. These safeguards ensure that sensitive records are kept secure and include accurate attribution.

Using legacy technology, transcriptions of meetings or situations where multiple people are present and talking—often interrupting each other—frequently attribute the end of one sentence to the speaker of the next statement. In any setting, including medical, education, legal, or business, transcription errors lead to mis-attributions that require reviewing the original audio files, wasting time and resources.

Data augmentation, far-field recognition, noisy environment filtering, and speaker ID create an ASR engine with a level of precision not previously possible. The improved accuracy of speech-to-text transcription is opening up more areas where ASR can create greater efficiencies. 

In any setting, including medical, education, legal, or business, transcription errors lead to mis-attributions that require reviewing the original audio files, wasting time and resources.

Advanced ASR use cases

Real-time transcriptions and the ability to process ASR and NLU simultaneously has widened the application of ASR to more use cases in more industries. Conversational AI now allows for imprecise speech, the use of extra words like, “uh” or “um” without interfering with the meaning.

What this means is that people using the technology don’t have to learn how to speak to the device. Instead, the device has the ability to not only understand the words being said, but the context in which they are spoken. Customizable solutions further enhance the usability of these systems. 

Contact centers, industrial settings, financial services, media creators, healthcare, and businesses each have the unique problems and challenges. A customized voice solution equipped with training data tailored to the unique environment, common user scenarios, specialized vocabularies, accents, and multiple languages opens the door to endless possibilities.

Here are a few possible use cases for advanced ASR in a variety of industries:

Contact centers

Popular use cases for advanced ASR technology include their use in customer contact centers where the voice AI can monitor conversations, provide instant transcriptions, and run analysis on caller sentiment and satisfaction. In this environment, voice AI can also be employed as a virtual assistant that increases efficiency by resolving those calls that don’t require the creativity of a human agent. 

Industrial and logistics

The implementation of voice AI is already growing in the industrial and logistics space. Warehouse management, including inventory logging and voice picking can be done faster and more accurately using a voice interface. The tasks of inventory pick and pull and updates can be performed simultaneously, giving businesses the most up-to-date inventory records possible. In addition, ASR technology can be embedded in walkie talkies and elevators to allow for hands-free operations. 

Financial services

Already, leading financial services companies like Bank of America are implementing voice assistant strategies for their customer interfaces. “Erica” performs account servicing, voice payment, and account processing through a voice-enabled mobile app and in the customer service center. The success and popularity of the voice assistant is creating a surge in voice-first strategies for other banking institutions.

See a voice assistant for banking in action:

In the banking and finance sectors, ASR solutions can also be used for legal transcription, contract term recording, and internal business operations—including internal and conferencing communication and transcription. External-facing devices, such as ATMS and banking kiosks can be voice-enabled to provide greater functionality and contactless experiences for customers seeking a more hygienic world.

Journalism and media

Voice AI can be used for both active and passive transcriptions in the journalism and media industries. Whether a journalist is recording an interview to ensure accuracy in reporting or a video is being equipped with captions, the same technology can be used.

For the journalist, listening and comprehending while quickly taking notes can be challenging—especially when an inaccurate quote can lead to legal action. The ability to focus on the interview and receive a real-time transcription means reporters can write the story as soon as the call has ended.

Multimedia journalists can meet FCC requirements to provide an accurate transcript of video content using the automatic dictation features of ASR technology. In addition, the easy inclusion of captions allows faster video production in an environment where deadlines are always looming.

For the journalist, listening and comprehending while quickly taking notes can be challenging—especially when an inaccurate quote can lead to legal action.

Healthcare

The healthcare industry is one of the most obvious applications for voice-enabled transcription services. Offering doctors and practitioners a way to record information, while maintaining eye contact with patients, provides a level of connection not possible when the physician is staring at a computer screen and typing. The growing popularity of virtual healthcare has spurred the need for accurate records of patient visits, doctor instructions, and prescription ordering.

In other healthcare settings, radiology devices and system voice control allow the technician to remain a safe distance from the device while decreasing the amount of time a patient must remain still or posed in uncomfortable positions. 

With better speech recognition technology, eldercare facilities and home-based healthcare can provide older adults with device control and voice user interfaces built to understand them. Perhaps more importantly, customized voice solutions can be designed to respond with a rate of speech and tone that’s easily understandable—elevating elder experiences and giving them increased autonomy.

Business operations

Large group meetings, video conferences, interviews, research groups, and training materials are made more efficient through the convenience and responsiveness of a voice assistant. Long-form transcription provides meeting logs that accurately record not just what was said, but who said it. Higher sentence accuracy through advanced machine learning models takes the guesswork out of transcription materials and provides a valuable record that can be referred to later.

Additionally, precise and real-time speech-to-text transcriptions democratize the workplace and increase accessibility to important information for employees who are differently abled. Hiring practices that include diversity and inclusion strategies are further expanded through easier access to more aspects of business operations. 

Precise and real-time speech-to-text transcriptions democratize the workplace and increase accessibility to important information for employees who are differently abled.

Multiple languages and accented language accuracy

Regardless of where your business headquarters are located, chances are you have employees, guests, customers, and patients for whom English is not their first language. Providing the same level of convenience and efficiency to non-English speakers as those who are fluent shows a level of understanding and care about your communities.

When choosing a voice AI solution for your ASR or full-voice assistant needs, look for providers with a growing library of languages. Once a voice AI platform has an established library of languages, the available language data can be used to train highly-accurate models for new languages much faster than when only one or two languages are present.

Providing the same level of convenience and efficiency to non-English speakers as those who are fluent shows a level of understanding and care about your communities.

In addition, acoustic models exposed to training data based on a wide range of subjects—both native language and second language speakers—deliver greater accuracy for those with accents or regional speech variations. Greater accuracy is achieved when voice models include training data gathered from distinct regions and large populations with known variations. Benchmarking those accented language differences ensures that future improvement can be measured for accuracy.

When advanced ASR technology is coupled with NLU components, the possibilities are endless. There are many ways to leverage ASR technology with solutions ranging from stand-alone ASR integrations to a complete voice AI strategy and omnichannel presence.

Acoustic models exposed to training data based on a wide range of subjects—both native language and second language speakers—deliver greater accuracy for those with accents or regional speech variations.

Companies and organizations can choose from a range of connectivity options from solely embedded with no cloud connectivity to a hybrid approach that includes both embedded and cloud capabilities, to a cloud-only solution. Whatever the use case, you can achieve your goals with an advanced ASR solution. 

At SoundHound Inc., we have all the tools and expertise needed to create custom voice assistants and a consistent brand voice. Explore Houndify’s independent voice AI platform at Houndify.com and register for a free account. Want to learn more? Talk to us about how we can help bring your voice strategy to life.

Karen Scates is a storyteller with a passion for helping others through content. Argentine tango, good books and great wine round out Karen’s interests.

You Might Also be Interested In

Interested in Learning More?

Subscribe today to stay informed and get regular updates from SoundHound Inc.