A/D/O by MINI | How Tech Found its Voice

Journal

Technology

How Tech Found its Voice

“Alexa, how were you made?” With AI voice technology now part of daily life, we investigate its design, development and growing dominance.

Whether or not you realize it, chances are you've recently had a conversation with a robot. Maybe it was apparent – Alexa giving the morning weather report or Siri helping with a shopping list – but perhaps it was more subtle, a suspiciously monotone customer service representative or an eerily predictive text message. Since Apple’s introduction of Siri in 2011, AI voice technology has exploded in popularity, with voice-enabled device sales reaching over $1 billion in 2018 and one in six Americans now possessing a “smart speaker” like Google Home or the Amazon Echo.

Researchers predict over 100 million smartphone users will access voice assistants by 2020, this nascent technology enabling companies to streamline operations, cull invaluable customer data, and provide a perpetually friendly voice at the end of the line. But how are AI voices crafted and implemented? Who is utilizing this evolving technology – and for what? Will AI voices soon be indistinguishable from real human intonations – and is this something to be celebrated or feared?

With the rise and proliferation of deepfakes, a visual technique that combines an image or voice with artificial intelligence to create a new, entirely too-lifelike image, there has been a wave of moral, industry-wide hand-wringing. But the desire for AI voice advancements persists. According to a new forecast from UK-based analytics and tech consulting firm Juniper Research, there will be eight billion digital voice assistants in use by 2023, landing in industries from healthcare to banking. (Bank of America recently unveiled its Erica chatbot, which uses human-backed research to provide AI-powered financial assistance.) And the software behind this AI is rapidly advancing – between 2013 and 2017, Google's word accuracy rate rose from 80% to an impressive 95%, meaning the days of embarrassing Alexa slip-ups may soon be in the rearview.

"The number of voice-based applications is on the rise, but why is voice-activated technology quickly replacing fingertips? It's convenient, and convenience is king," said Ahmed Shafei, a business and technology leader focusing on security, voice UI, AI, and the Internet of Things (IoT). "Technology is meant to save us time and make us more productive, and that's exactly what voice offers. It feeds our love and addiction for speed and removes the friction between users and devices." Shafei explained that artificial intelligence, neuro-linguistic programming (NLP), machine learning (ML) and deep learning capabilities are all changing the way businesses and entire industries work – and we can only expect the evolution continue.

Amazon's Echo Plus integrates the company's AI voice assistant Alexa

The first AI system to successfully recognize the human voice was "Audrey," a machine created by Bell Labs in 1952 capable of understanding spoken numbers. This breakthrough led to two decades of innovation that included Shoebox, a machine able to process up to 16 spoken words, and an Automatic Call Identification system enabling engineers to talk to and receive spoken words (both created by IBM). These modest successes paved the way for modern AI speech-related technologies, with IBM’s Watson being the most visible heir. Speech recognition software works by absorbing and analyzing sounds, filtering and digitizing human words into a format the machine can "read" and understand. Based on algorithms and previous input, the software can then make a highly accurate predictive language guess.

But there are limits – if the speech recognition software is only used by one person, it will be trained specifically on that person's "phonemes”, the distinct parts of sound in a given language that distinguish one word from another. This is because AI systems work similarly to the brain. Newborns, first hearing their parents speak, are unable to comprehend and reply, yet still absorb verbal cues and pronunciations. This human "input”, once absorbed to the brain, eventually forms patterns and connections – and ultimately words and language comprehension skills. Speech recognition technology works similarly, but is the product of countless hours of data, research, and innovation. And new technologies such as cloud-based processing have only improved AI’s ability to absorb and understand a larger variety of words, languages, and accents.

This technology is now making its way into the classroom. At ISTE 2018, the world’s largest EDTECH conference, Frontline Education demoed a program for educators that utilizes Amazon Echo to provide "on-demand" real-time data, allowing for remote teaching and learning opportunities. At Northeastern University, after a successful pilot, students are able to use Alexa to access resource materials and data instantly.

Voice-command technology may also soon help students with learning or physical disabilities who are unable to write or type to utilize speech dictation, also enabling students with dyslexia to take advantage of voice-enabled technology to assist with reading and spelling. There is also the potential for ESL teachers to integrate voice technology into the curriculum to improve student’s grammar skills, also collecting real-time educational data to improve student’s unique language needs. 

Similar software is already being utilized by Great Britain's National Healthcare System (NHS), which recently collaborated with Amazon to empower blind, elderly, and disabled patients to more easily access auditory information through Alexa. It's hoped this will free up resources within the NHS, as Amazon's algorithm enables patients to access information on symptoms, treatment options, and available resources. (Unfortunately, the NHS cannot give patients free Amazon Echos, but those interested can access the program through an app.)

Alexa is integrated into BMW vehicles to assist drivers

Realistically, most AI voice technology will be both developed and utilized for private companies. Potentially a trillion dollar business, voice technology has already been used by call centers to increase volume and conversational capabilities, with companies like Salesforce planning to invest in technology like Einstein Voice Assistant, able to automatically process data on upload. Even burger giant McDonald's has announced plans to utilize AI to create interactive menus.

It is estimated that by 2020, 50% of all searches will be voice searches, with many businesses already integrating their products with AI, including BMW, which received positive nods from tech-enthusiasts for successfully upgrading Alexa into systems. While it’s anticipated this type of automation will lead to widespread layoffs, it’s also predicted that, as consumers become increasingly acquainted with voice technology, there will be a greater need for designers with expertise in voice interface design and voice app development.

But there is still one challenge slowing down the rate of AI voice automation: software’s ability to comprehend regional accents and intonation. Recognizing the stumbling block, the BBC is launching a rival to Amazon's Alexa called Beeb to decipher British accents (many US-developed products struggle to understand strong regional accents). To create the language database, the BBC asked staff throughout the UK to record their voices to evolve the software. It’s ultimately hoped Beeb will allow viewers, now matter how thick their accent, to rapidly search through programs and other online services.

“When people are more comfortable with a certain voice, they feel more encouraged to talk naturally,” said a former AI contractor for Google, who explains the drive to create a more natural-sounding AI. “If a wide range of people use the device, the program will get better at recognizing a wide variety of voices, and can react more accurately.” This has put pressure on AI developers to seek out a wider range of voice researchers, across regions, races, and socio-economic spectrums. “The lack of representation in all stages of training AI means a lack of accuracy in recognizing all kinds of voices and all kinds of faces,” they explained, noting that much of the software they tested tended to be more responsive to the subtle intonations of caucasian American voices, as these were the phonemes the software was  trained on.

The Apple Homepod is among a wide range of products that incorporate voice assistants

It is anticipated that as voice assistants become more sophisticated, they will offer more individualized experiences, with many companies already investing in smarter voice processing systems. For example, Google Translate has evolved from a simple text translation service to an advanced translator with the ability to someday translate dozens of languages in seconds. It is also predicted that users will become more emotionally attached to their assistants. According to Ida Siow, head of planning, Singapore & SEA at J Walter Thompson: "a third of regular voice users in Singapore have admitted to having a sexual fantasy about their voice assistant, which demonstrates just how easy it is to anthropomorphize this form of AI technology," she explained in a 2017 Forbes article. "It's just easier to connect with a voice." 

While we may be far off from a future in which our AI devices become our best friends and potential partners (remember the 2013 movie Her?), there is no denying that we’re heading into a brave new world of tech. And, it seems, we’ll have plenty of voices leading the way.

Text by Laura Feinstein.