source : Machine learning algorithm full stack engineer

author : Mengkang

edit : Wang Shuwei

In total 6050 word , Recommended reading 10 minute .
This paper will help you understand the research methods and future trends of intelligent dialogue system .

In recent research, the author found a very good paper about dialogue system ,《A Survey on Dialogue Systems:Recent Advances and
Frontiers》, The paper comes from Jingdong data team , The paper quoted recent 124 Papers , Is a comprehensive introduction to the dialogue system , It's full of sincerity , Today we're going to focus on that , offer to the reader .


Having a virtual assistant or a chat partner system with enough intelligence seems to be illusory , And probably only in science fiction movies for a long time . in recent years , More and more researchers pay attention to human-computer conversation because of its potential and attractive commercial value .

With the development of big data and deep learning technology , Create an automated human-computer conversation system as our personal assistant or chat partner , Will no longer be a fantasy . 

current , People pay more and more attention to dialogue system in various fields , The continuous progress of deep learning technology greatly promotes the development of dialogue system . For dialog system , Deep learning technology can use a lot of data to learn feature representation and recovery generation strategies , Only a small amount of manual operation is needed .

Now , We can easily access conversations on the Internet “ big data ”, We may be able to learn how to respond , And how to reply to almost any input , This will greatly allow us to build data-driven , Open dialogue system .

on the other hand , Deep learning technology has been proved to be effective , Complex patterns can be captured in big data , And has a large number of research fields , Such as computer vision , Natural language processing and recommendation system, etc . In this paper , From different perspectives, the author gives an overview of these recent developments in the dialogue system , Some possible research directions are also discussed . 

say concretely , Dialogue system can be roughly divided into two types :

Task oriented (task-oriented) Dialogue system and

Non task oriented (non-task-oriented) Dialogue system ( Also known as chat robot ).

The purpose of task oriented system is to help users to complete practical and specific tasks , For example, help users find products , Book hotel restaurant, etc . 

The widely used method of task-oriented system is to treat conversation response as a pipeline (pipeline), As shown in the figure below : 

The system first understands the message that humans convey , As an internal state , Then a series of corresponding actions are taken according to the strategy of dialogue state , Finally, the movement is transformed into the expression of natural language .

Although language understanding is handled through statistical models , But most deployed dialog systems still use manual features or rules , For state and action space representation , Intent detection and slot filling . 

Non task oriented dialogue system and human interaction , Provide reasonable response and entertainment functions , Generally speaking, it focuses on the open field to talk with people . Although the non task oriented system seems to be chatting , But it works in many practical applications .

data display , In the online shopping scene , near 80% Our words are chatting information , The way these issues are handled is closely related to the user experience .

generally speaking , For non task oriented dialogue system , At present, there are two main methods :


Generation method , For example, sequence to sequence model (seq2seq), Generate appropriate responses during the conversation , Generative chat robot is a hot topic in the research field , Different from the retrieval chat robot , It can generate a new kind of reply , So it's relatively flexible , But it also has its own shortcomings , For example, sometimes there are grammatical errors , Or generate meaningless replies ;


Retrieval based method , Search from a predefined index , Learn to choose a reply from the current conversation . The disadvantage of retrieval method is that it relies too much on data quality , If the selected data is of poor quality , It's very likely that all the previous achievements will be lost . 

in recent years , The rapid development of big data and deep learning technology , Greatly promoted the development of task oriented and non oriented dialogue systems .

In this paper , The author's goal is

Overview of dialogue system , Especially the recent development of deep learning ;

Discuss possible research directions .

Task oriented system

Task oriented dialog system is an important branch of dialog system . In this section , The author summarizes the pipeline method and end-to-end method of task oriented dialog system .

The Conduit (pipeline) method

The typical structure of task oriented dialogue system is shown in the previous figure , It consists of four key components : 

natural language understanding (Natural Language Understanding,NLU): It parses user input into predefined semantic slots . 

If there is a word , Mapping natural language understanding to semantic slots . Slots are predefined for different scenarios . 

The figure above shows an example of a natural language representation , among “New
York” Is designated as slot Value location , The domain and intention are specified respectively . typical , There are two types of representations . One is discourse level category , Such as user's intention and discourse category . The other is word level information extraction , Such as named entity recognition and slot filling . Conversation intention detection is to detect the user's intention . It divides discourse into a predefined intention . 

Conversation status tracking (Dialogue State Tracker,DST)
. Dialog state tracking is the core component to ensure the robustness of dialog system . It estimates the user's goals in each round of the conversation , Manage input and conversation history for each round , Output current conversation status . This typical state structure is often called slot filling or semantic framework . Traditional methods have been widely used in most business implementations , Manual rules are usually used to select the most likely output results . however , These rules based systems are prone to frequent errors , Because the most likely outcome is not always ideal . 

The most recent deep learning method is to use a sliding window to output any number of probability distribution sequences of possible values . Although it is trained in one area , But it can easily move to new areas . The most commonly used model here is ,multi-domain
RNN dialog state tracking models and Neural Belief Tracker (NBT) . 

Dialogue strategy learning (Dialogue policy learning)
. According to the state representation of state tracker , Strategy learning is to generate the next available system operation . Both supervised learning and reinforcement learning can be used to optimize policy learning . Supervised learning is aimed at the behavior produced by rules , In the online shopping scene , If the conversation status is “ recommend ”, Then trigger “ recommend ” operation , The system will retrieve products from the product database . The introduction of reinforcement learning method can further train dialogue strategies , To guide the system to develop the final strategy . In the actual experiment , The effect of reinforcement learning method is better than that based on rules and supervision . 

Natural language generation (Natural Language Generation,NLG). It will select actions to map and generate a reply . 

A good generator usually depends on several factors : Appropriateness , Fluency , Readability and variability . conventional NLG The method is usually to execute the sentence plan . It maps the input semantic symbols to the intermediary forms representing discourse , Such as tree or template structure , The intermediate structure is then transformed into the final response by surface implementation . The mature method of deep learning is based on LSTM Of encoder-decoder form , Information about the problem , Semantic slot value and conversation behavior type are combined to generate correct answers . At the same time, attention mechanism is used to deal with the key information of decoder's current decoding state , Generate different responses according to different behavior types . 

end to end (end-to-end) method

Although the traditional task-oriented dialogue system has many handmade in specific fields , But it's hard for them to adapt to new areas , in recent years , With the development of end-to-end neurogenesis model , An end-to-end trainable framework for task oriented dialogue system is constructed . It should be noted that , When we introduce non task oriented dialog system , More details on the neural generation model will be discussed . Different from the traditional pipeline model , One module for the end-to-end model , And interact with the structured external database . 

The model above is a network-based end-to-end trainable task oriented dialogue system , Taking the learning of dialogue system as the mapping problem of learning from dialogue history to system reply , And apply encoder-decoder Model to train . however , The system is trained under supervision —— Not only need a lot of training data , Moreover, due to the lack of further exploration on dialogue control of training data , It may not be able to find a good strategy . 

With the development of reinforcement learning research , The model above first proposes an end-to-end reinforcement learning method , Joint training of dialogue state tracking and dialogue strategy learning in dialogue management , So as to optimize the action of the system more effectively .

Non task oriented system

Different from task oriented dialogue system , Its goal is to accomplish specific tasks for users , Instead of task oriented dialogue system ( Also known as chat robot ) Focus on talking to people in open areas . generally speaking , Chat robot is realized by generating method or retrieval based method . 

Generating models can generate more appropriate responses , And these responses may never appear in the corpus , The retrieval based model has the advantages of abundant information and smooth response .

1. Neurogenesis model (Neural Generative Models)

Successful application of deep learning in machine translation , Neural machine translation , Arousing people's enthusiasm for the study of neurogenic dialogue . At present, the hot research topics of neural generation model are as follows .

1.1 Sequence-to-Sequence Models

Given contains   Input sequence of words (message)

And length T Target sequence of (response)

Model maximization Y stay X Conditional probability under : 


say concretely ,Seq2Seq The model is in encoder-decoder In structure , The figure below shows the structure : 

The encoder will X Read word for word , And through recurrent neural network (RNN) Represent it as a context vector c,  Then the decoder will c As input estimate Y Generation probability of . 

Encoder :

Encoder The process is simple , Direct use RNN( General use LSTM) Generate semantic vector : 

among f  It's a nonlinear function , for example LSTM,GRU,

Is the last hidden node output , Is the input of the current time . vector c Usually RNN Last hidden node in

(h, Hidden state), Or the weighted sum of multiple hidden nodes . 

Decoder :

Modelled decoder The process is to use another RNN Predict the current output symbol by the current hidden state  , The sum here is related to its previous hidden state and output ,Seq2Seq The objective function of is defined as : 

1.2. Conversation context (Dialogue Context)

Considering the context information of dialogue is the key to building a dialogue system , It can keep the conversation consistent and enhance the user experience . Use hierarchical RNN Model , Capturing the meaning of individual statements , And then integrate it into a complete conversation .

meanwhile , Expand the hierarchical structure with attention methods at word level and sentence level respectively .

Test certificate :

Hierarchy RNNs Is usually better than non hierarchical RNNs;

After considering context sensitive information , Neural networks tend to produce longer , More meaningful and diverse responses . 

In the picture above , By representing the whole history of dialogue ( Include current information ), Using continuous representation or embedding words and phrases to solve the problem of context sensitive reply generation . 

In the structure of the figure above, the author introduces two levels of Attention mechanism , Let the model automatically learn the importance information of words and sentences , So as to better generate a new round of dialogue .

In sentence level information , It's reverse learning , That is to say, in the message of the next sentence 更能够包含上一句的信息,所以从总体上来看,其对于对话的学习是逆向使用每一轮对话的内容的.

1.3 回复多样性(Response Diversity) 

don't know”,“I am OK”这样的无意义回复.

End-To-End Dialogue Systems 

Using Generative Hierarchical Neural Network Models》使用了 latent variable

1.4 主题和个性化(Topic and Personality) 




1.5 外部知识库(Outside Knowledge Base) 



上图是作者提出的完全数据驱动的带有知识的对话模型.其中的 World Facts是一个集合,收集一些经过权威认证过的句子或者不准确的句子,作为知识库.

当个定一个输入S和历史,需要在 Fact 集合里面检索相关的facts,这里采用的IR引擎进行检索,然后经过 Fact Encoder 进行 fact


1.6 评价 



 计算 BLEU 值,也就是直接计算 word overlap,ground
truth和你生成的回复.由于一句话可能存在多种回复,因此从某些方面来看,BLEU 可能不太适用于对话评测. 

计算 embedding的距离,这类方法分三种情况:直接相加求平均,先取绝对值再求平均和贪婪匹配. 

衡量多样性,主要取决于 distinct-ngram 的数量和 entropy 值的大小.

进行图灵测试,用 retrieval 的 discriminator 来评价回复生成. 

2. 基于检索的方法    


2.1 单轮回复匹配 



2.2 多轮回复匹配 









2.3 混合的方法(Hybrid Methods)