03.16 CAAI特約專欄丨黃學東 Speech and Language to AI Evolution

來源:《中國人工智能學會通訊》2018年第1期特約專欄

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

1 Speech and Language to Homo Sapiens

Amongst all creatures the human species stands unique in Darwin’s natural selection process because of our ability to communicate, our ability to manipulate symbols, and our ability to construct language. Speech and language provides the way we communicate our collective intelligence from one generation to the next. It is no exaggeration to state that it is speech and language that differentiated human intelligence from animal intelligence in the evolution:

● We communicate and collaborate with speech, write down what we know, and pass our knowledge from one generation to another.

●We create laws to properly govern ourselves.

●We express ourselves through our deep insight and feelings.

The power of speech and language to influence our culture, art, science, and innovation is unlimited. The impact of speech and language to the evolution of Artificial Intelligence (AI) should be as foundational as speech and language to the evolution of homo sapiens!

2 Speech and Language to AI

AI has been envisioned by science fiction for years, but today’s AI technologies still bear little resemblance to the human-like capabilities that are a staple of the fictional realm. In part, this is due to the persistent limitations of the speech and language understanding capabilities in AI agents.

There are two broad sets of AI capabilities:

● Perceptual intelligence: speech and vision -mostly pattern recognition capabilities (other forms of perceptual intelligence include vision, smell, touch, taste, but these are not uniquely human)

●Cognitive intelligence: language understanding, learning, planning and reasoning. There is a continuum from low-level perception to high-level cognitive reasoning, and the high-level feedback to the low level to constrain and guide the decision.

We have witnessed computers’ ever improved ability to handle perceptual tasks from speech recognition to image classification. However, for truly cognitive tasks, there are remaining unsolved deep challenges.

3 Speech Recognition Lessons

In their historical review[1] of speech recognition published in 2014, the authors believed the speech community is on its way to pass the Turing test in the next 40 years for everyday speech recognition scenarios. Less than 3 years after that review paper was published in Communications of the ACM, Microsoft researchers demonstrated that speech recognition is on par with human transcribers on the widely used research benchmark Switchboard task, which represented two people naturally conversing over a telephone line on a wide range of everyday topics. The Switchboard task has been very challenging for speech recognition researchers in the past decades as illustrated in Figure 1 (adapted from an Economist cover story[2] reporting the latest milestone[3] of 5.1% word error rate). The historical milestone is the first step to significantly broaden speech for everyday scenarios.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

Figure 1 Historical speech recognition word error rate chart on a wide range of public benchmark tasks (adapted from Economist, January 2017 cover story). Microsoft's latest Switchboard error rate of 5.1% added in green

The establishment of machine learning algorithms, supported by the availability of powerful computing infrastructure and massive training data, constitutes the most significant driving force in advancing the development of speech capabilities. To reach the human parity milestone, Microsoft’s Switchboard speech system[4] applied these pillars of machine learning – modeling research, data, and compute power. We used multiple acoustic models and multiple language models, fine-tuned by thousands of experiments on massive GPU clusters. The necessity of running thousands and thousands of speech recognition experiments in a timely manner drove Microsoft to produce the most efficient deep learning toolkit, Microsoft Cognitive Toolkit (CNTK[5]), which was instrumental in obtaining the human parity progress.

While the Switchboard human parity milestone is impressive, it is not yet available as a real-time production service in Azure Cognitive Services. In addition, recognition of far-field, noisy and accented speech remains challenging. The error rate of speech recognition in these conditions could be significantly higher than that of near-field speech recognition, depending on microphone distance, reverb, environmental noise, and speaker characteristics.

4 Speech Impact to Other Fields

The speech recognition community has a long history of pioneering foundational machine learning approaches and technologies that have deeply influenced related fields in AI.

In the 1980’s the traditional AI approach of encoding expert knowledge in procedural and rule-based systems for speech recognition yielded to the statistical machine learning framework best exemplified by Hidden Markov Models (HMMs). While HMM’s were well known before, it was the advent of the Baum-Welch algorithm that made it possible to estimate the parameters of the model efficiently. HMMs treat phonetic, word, syntactic and semantic knowledge representations in a unified manner. As such we no longer needed explicit segmentation and labeling of phonetic strings. Phonetic matching and word verification are unified with word sequence generation such that overall probability of a hypothesis is maximized.

In 2010, Microsoft researchers[6] successfully demonstrated the HMM output distributions can be augmented with feed-forward deep neural networks (DNN) using senone-based cross entropy. The new system significantly improved the acoustic model accuracy for large vocabulary speech recognition. Recently, recurrent neural networks such as long-short-term memory (LSTM) nets further improved on DNN performance by modeling the time-varying characteristics of speech and language. Deeper convolutional neural networks originally applied to computer vision[7] tasks also benefited speech when they were combined with LSTMs and other types of DNNs.

It is now well understood that DNNs are suitable to learn the nonlinear mapping function between two sequences. When a large amount of training data is provided, the DNN can generalize the sequence to sequence mapping well, compared to other machine learning approaches. For speech recognition, the task is to map the speech waveform sequence to the corresponding word sequence. The objective is to minimize the word recognition error rate. While it has been shown that using a DNN for direct sequence to sequence mapping between the speech waveform and the word sequence is possible, it is typically more parsimonious to derive a hybrid model using prior speech and language knowledge, which avoids the cost of learning the entire input-output mapping from scratch.

Lessons learned in speech recognition have directly influenced machine translation[8], with a similar paradigm shift. Statistical machine translation (SMT) significantly improved on the quality of rule-based MT approaches. Neural MT (NMT) further improved on SMT, as illustrated in both Microsoft and Google’s quality benchmark in Figure 2. Both companies brought NMT to market around the same time with significantly improved translation quality that benefited consumers broadly.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

Figure 2 Both Microsoft and Google brought NMT to the market for consumers around the same time with significantly improved MT quality

5 Language Understanding Remains Challenging

While deep learning has helped well-defined Perceptual AI tasks, such as speech and image recognition, it remains unclear if natural language understanding breakthroughs are within our reach due to many challenges including:A lack of abundant domain-independent training data for general purpose language understanding.

● A clearly defined language understanding target domain does not exist. Whereas speech recognition, speech synthesis, and machine translation tasks have well-defined inputs and outputs, the output domain for language understanding is not well defined.

● Language understanding requires both common sense and domain knowledge.

The final point is a chicken and egg problem since acquiring knowledge requires language understanding, and language understanding requires knowledge to start with. It is important to note the development of common sense knowledge also need fully sensing the world such as social interaction to grasp fundamental meanings associated with speech and language.

Most of today s’ natural language understanding approaches are dependent on labor intensive rule-based engineering to acquire a consistent knowledge base (KB). These rules are often fragile with limited ability to generalize to novel patterns of language. While traditional methods have been augmented with statistical and deep learning approaches, the improvements have been limited and don’t apply well to general purpose tasks. As a result, most of today’s conversational dialog systems remain either limited to web search (Bing/Cortana and Google Search/Google Assistant) or effective only for a limited set of specific domains.

Since there is no available general-purpose language understanding training data, it is possible that MT may hold the key for future breakthroughs, thanks to the availability of abundant parallel multi-lingual data. While NMT is not directly addressing language understanding problems, NMT models may essentially contain the underlying semantic representation of languages. For example, NMT embedding vectors can substantially improve language understanding performance in Stanford’s SQuAD[9] question answering task.

For domain-specific language understanding, the outcome can be very impressive. One such example is Microsoft’s Dynamics 365 AI solution[10] for customer care – an intelligent dialog system designed to support enterprise customer care (Project Toronto) as illustrated in Figure 3. It has been successfully deployed at companies such as Microsoft[11], HP, and Macy’s.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

Figure 3 Microsoft's Project Toronto for customer support conversational dialog

As expected, knowledge extraction remains labor-intensive with hand-crafted or semi-automated dialog>

The Diagnosis Policy learner model is obtained by augmenting existing search relevance models and processes by engaging with customers in multi-turn dialogs to clarify the targeted intent resolution. Deep Reinforcement Learning is embedded in the live agent session to close the feedback loop for continuous future improvement, which requires long-term usage data from live agents to form a virtuous cycle as illustrated in Figure 4.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

Figure 4 System architecture of Project Toronto for customer support services

While Project Toronto can provide a much deeper dialog service, its knowledge is limited to a company’s specific customer support domain. In contrast to current web search services such as Bing and Google that have much broader domain coverage, current web search answers are not as deep Project Toronto that is optimized for a specific domain. These two examples illustrate the current dilemma of language understanding in a T-shaped diagram illustrated in Figure 5.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

Figure 5 Language understanding/conversationalems in a T-shaped brea systems in a T-shaped breadth vs depth dilemma

6 Speech and Language Outlook

I argued here that speech and language have a role in the evolution of AI that mirrors their key role in the development of human intelligence. With the advent of ambient computing devices from the Amazon Echo with Alexa[12] to the Harman/Kardon Invoke with Cortana[13], speech and language technologies, albeit still primitive now, can already dramatically help people with a wide range of tasks from Office 365 calendaring to enjoying music provided at home. Another example is Microsoft’s Chinese chit-chat bot Xiaoice[14], whose popularity is attributed to its distinct, teenage girl-like personality.

Xiaoice has a unique use of advanced speech technologies as illustrated in Figure 6. Input-wise, speech recognition, gender, age, emotion, and music identification are captured for language understanding to provide more targeted, personalized responses. On the speech output side, words can be read, sung, or pronounced in a sad, happy, angry, or upset manner, bringing out the personality of the bot. Additional sensing signals are very helpful to the overall user experience.

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

(a)

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

(b)

Figure 6 Advanced speech technologies used in Xiaoice (a) Speech Input (b) Speech Output

The research community has made dramatic progress on speech recognition, speech synthesis,machine translation, sentiment analysis, and domain-specific language understanding. However, general purpose language understanding and contextual reasoning with a continuous learning capability remain an elusive goal for all of us. Only after we fill the box with many vertical domains in Figure 5 can we claim we are approaching general language understanding for the dawn of AI. Of course, the whole box must make sense with a cohesive story built on the knowledge graph across all domains.

With the relentless pursuit of human parity for general purpose language understanding, I hope our AI progress will eventually understand all books and all media ever created by humans across different fields. With such an ability to fully understand and comprehend all media ever created by humans, AI’s societal impact will exceed the most dramatic visions of science fiction we have ever seen.

1.https://cacm.acm.org/magazines/2014/1/170863-a-historical-perspective-of-speech-recognition/fulltext

2.http://www.economist.com/technology-quarterly/2017-05-01/language

3.https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/

4.https://www.microsoft.com/en-us/research/publication/microsoft-2017-conversational-speech-recognition-system/

5.https://github.com/Microsoft/CNTK

6.https://www.microsoft.com/en-us/research/blog/speech-recognition-leaps-forward/

7.https://blogs.microsoft.com/ai/microsoft-researchers-win-imagenet-computer-vision-challenge/

8.https://blogs.msdn.microsoft.com/translation/

9.https://rajpurkar.github.io/SQuAD-explorer/

10.https://www.microsoft.com/en-us/AI/ai-solutions

11.https://support.microsoft.com/en-us/contact/virtual-agent/?flowId=smc-home-hero

12.https://www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?ie=UTF8&node=9818047011

13.https://www.microsoft.com/en-us/windows/cortana

14.http://www.msxiaoice.com/

CAAI特约专栏丨黄学东 Speech and Language to AI Evolution

博士,微軟人工智能及微軟研究事業部全球技術院士,目前領導微軟在美國、中國、德國、以色列的全球團隊,負責微軟公司語音和語言人工智能產品和技術研發。作為微軟首席語音科學家,領導的語音和對話研究團隊在 2016 年取得了語音識別歷史性的里程碑。

微信號:CAAI-MemberCenter

CAAI會員中心

長按識別二維碼關注我們

CAAI推薦

【精品文章】


分享到:


相關文章: