当前位置:网站首页>ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation

ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation

2022-04-23 16:41:47 PaperWeekly

d5afb52603e555d9b9cc33c62e9ab430.gif

author  | Hanscal

Research direction |  Knowledge map 、 Dialogue system

cfbfb6f142fcea48d9bf79a8ddc6dbdd.png

Paper title :

DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation

Thesis link :

https://openreview.net/forum?id=WuVA5LBX5zf

This is the data intelligence and Social Computing Laboratory of Fudan University (Fudan DISC) stay ACL 2022 The work of a dialogue pre training . This paper focuses on the important research topic of how to generate relevance and diversified responses in the open domain , In this paper, a new Dialogue pre training framework DialogVED, It introduces continuous implicit variables into the enhanced encoder - Decoder frame , To improve the relevance and diversity of generated responses . 

It is well known that there is one or more questions in the generated dialogue , That is, a single conversation context can follow multiple reasonable responses . The existing work introduces latent variables to simulate this problem . lately ,PLATO Discrete implicit variables are introduced into the pre training dialogue model , It also shows significant performance improvement in multiple downstream response generation tasks . In addition to discrete latent variables , Continuous latent variables are also commonly used to model one to many mappings in dialogue systems , However, the potential of combining continuous implicit variables with large-scale language pre training is rarely explored .

a2547f2816f8e27e08c0fcd1cec295c5.png

Optimize content

In this paper , The author proposes a method to introduce continuous hidden variables into the encoder - Model of decoder framework DialogVED, And optimize the following 4 A training goal for pre training :1) Mask language model loss , To enhance the encoder's understanding of the context ,2) have n-gram The response to the loss generates a loss , To improve the planning ability of the decoder , 3)Kullback-Leibler Divergence loss , To minimize the difference between a posteriori distribution and a priori distribution of hidden variables , as well as 4) Word bag loss to reduce a posteriori distribution collapse . Besides , The effects of absolute and relative position coding on the performance of the model are also discussed .

d596bac90c1c6351c78401b371716efd.png

Model structure

DialogVED from Encoder 、 Decoder and implicit variables form , The encoder determines the distribution of the hidden space , Hidden variables are sampled from the hidden space , The encoder and the implicit variable jointly guide the decoder , The overall framework is shown in the figure below , Four in the yellow box task Is what needs to be optimized :

911a71b36f30adf23c524502a7827114.png

▲ DialogVED Pre training and fine tuning framework

71365c8dcf84b31ce68e0655f59e7048.png


Encoder

Use multiple layers Transformer Encoder to encode conversation context , In order to improve the understanding ability and noise robustness of the encoder , Use span masking The way of randomly masking part of the context . In the process, a simple method is used to cover span:1) Randomly select... In context n individual token, Expressed as S ;2) For each token , Expand it to a fixed length of m The text of span;3) In order 、 Mask all selected marks after de duplication and boundary check token. and BERT similar , Controls the masked in the context token Account for about 15%, And use covered token The encoded implicit state representation predicts itself . Only in the pre training stage , Conduct span-making Mask context operations , The loss function is :

1a4aa76d575fb35c5abf1998e0349734.png

▲  The cross entropy loss function predicted by shielding words

LSM yes log softmax function .

14012fb3e2f9f67452f8aaebadfae955.png

Hidden variables

Intuitively , The introduction of implicit variables can provide a hierarchical generation method , The high-level semantic variable is determined by using the high-level semantic , It is then decoded to generate sentence level syntactic and lexical details . Similar to the variational self encoder , Mainly minimize the following two loss items : Reconstruction losses ( Or negative log likelihood ) and K-L Regularization term :

1e71def2fd2399d1bbad41907a8c6a9d.png

▲  Rebuild the loss function

e921ba30a444b802620535e304bce0ff.png

▲ KL Regularization term

Because in the process of training ,KL The loss will drop rapidly to 0(KL Disappearance or posterior collapse ), The implicit variable space loses its expressive ability , So this paper introduces two methods to improve KL When divergence disappears , One is Free Bits, stay KL Introduce... Into the loss Hinge loss normal form , That is, add a constant to the loss function λ, The other is Bag-of-words predicting, The loss function introduced by hindering the model to predict words with autoregressive paradigm .

stay Free Bits in , In order to allow more information to be encoded into implicit variables , Give Way KL Each dimension of the term is “ Reserve a little space ”. say concretely , If this one-dimensional KL It's too small , Don't optimize it , Wait until it increases beyond a threshold before optimizing . 

stay Bag-of-words Predicting in , Let the implicit variable predict the words in the response in a non autoregressive way , That is to encourage implicit variables to contain the returned vocabulary information as much as possible . This method can be seen as increasing the weight of reconstruction loss , Let the model pay more attention to the optimization and reconstruction of loss items .

cfd8fb54126f94d3fd1c555891c39101.png

ab6af15933ec193ca05a8e4c92da956b.png

▲  The first is Free Bits Loss, The second is Bag-of-words Loss

The author adds a special classification mark at the beginning of the context [CLS], The corresponding implicit state representation is used to represent the global conversation context . It is assumed that the posterior distribution of the hidden vector is normal , And use MLP Layers will [CLS] The corresponding corresponding states are mapped to the mean and logarithmic variance of the hidden space , By sampling the normal distribution , You can get the corresponding implicit variable , Then input the implicit variable into the decoder . 

Here you can see , The loss function consists of 4 Item composition :1) Cover language loss at encoder end ;2) Based on future predictions n-ngram Response generation loss ;3) A priori distribution and a posteriori distribution of hidden variables K-L Divergence loss ;4) The word bag predicts the loss .

a49361c492b7fd543a8905ef91dd5071.png

decoder

The future prediction strategy is used in the decoder , And each time step predicts only the next token Different , Forecast the future at the same time n A future token. say concretely , The original Seq2Seq The model aims to optimize the conditional likelihood function , The optimization goal of the future prediction strategy will be , among Represents the next consecutive n A future token. In the future n-gram Predicting losses can clearly encourage models to plan for the future token forecast , And prevent over fitting of strong local correlation .

meanwhile , In the decoder ProphetNet Proposed in n-stream Self attention mechanism .n-stream The mechanism of self attention is main stream( Main stream ) In addition to n An extra self focus predicting stream( Flow prediction ), These prediction streams are used to predict n A continuous future token.Main stream and predicting stream Readers are advised to read ProphetNet [2]. 

Last , In order to connect the implicit variable and the decoder , Similar measures have been taken OPTIMUS [3] Proposed in Memory Scheme, That is, the implicit variable is mapped to an additional memory vector , This is an additional key value pair .memory vector It is equivalent to adding a virtual device during decoding token Participate in main stream The calculation of self attention , and predicting stream Through and with main stream Interaction is affected by memory vector Implicit influence . In this way, the implicit variable can pass through memory vector Guide the generation of each step of the decoder .

2a50404118b22ab529a671d7c7cb763b.png

summary

This article is summarized as follows :1) A pre training dialogue model is proposed , It incorporates continuous implicit variables into the enhanced encoder - Decoder pre training framework ;2) Explored the size of implicit variables 、 Different decoding strategies 、 The impact of rounds and role location coding on the performance of the model ;3) Experiments show that , The model achieves good performance in multiple downstream tasks , Better relevance and diversity in response generation .

outside_default.png

reference

outside_default.png

[1] DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation 

[2] Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training 

[3] Optimus: Organizing sentences via pre-trained modeling of a latent space

e3b068035d2fd57f40531873db29c0a2.png

Give Welfare !

Exclusive customization In alchemy /Fine-Tuning

Super mouse pad

limited  200 Share  

Code scanning reply 「 Mouse pad 」 

Participate and receive... Free of charge now  

b50d6f0c4f8576344287a46f413a0503.png

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

·

15cd09030bda6d88b06b44ca918c02b3.png

版权声明
本文为[PaperWeekly]所创,转载请带上原文链接,感谢
/html/ezVkYS.html

随机推荐