当前位置:网站首页>ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation
ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation
2022-04-23 16:41:00 【PaperWeekly】

author | Hanscal
Research direction | Knowledge map 、 Dialogue system

Paper title :
DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation
Thesis link :
https://openreview.net/forum?id=WuVA5LBX5zf
This is the data intelligence and Social Computing Laboratory of Fudan University (Fudan DISC) stay ACL 2022 The work of a dialogue pre training . This paper focuses on the important research topic of how to generate relevance and diversified responses in the open domain , In this paper, a new Dialogue pre training framework DialogVED, It introduces continuous implicit variables into the enhanced encoder - Decoder frame , To improve the relevance and diversity of generated responses .
It is well known that there is one or more questions in the generated dialogue , That is, a single conversation context can follow multiple reasonable responses . The existing work introduces latent variables to simulate this problem . lately ,PLATO Discrete implicit variables are introduced into the pre training dialogue model , It also shows significant performance improvement in multiple downstream response generation tasks . In addition to discrete latent variables , Continuous latent variables are also commonly used to model one to many mappings in dialogue systems , However, the potential of combining continuous implicit variables with large-scale language pre training is rarely explored .

Optimize content
In this paper , The author proposes a method to introduce continuous hidden variables into the encoder - Model of decoder framework DialogVED, And optimize the following 4 A training goal for pre training :1) Mask language model loss , To enhance the encoder's understanding of the context ,2) have n-gram The response to the loss generates a loss , To improve the planning ability of the decoder , 3)Kullback-Leibler Divergence loss , To minimize the difference between a posteriori distribution and a priori distribution of hidden variables , as well as 4) Word bag loss to reduce a posteriori distribution collapse . Besides , The effects of absolute and relative position coding on the performance of the model are also discussed .

Model structure
DialogVED from Encoder 、 Decoder and implicit variables form , The encoder determines the distribution of the hidden space , Hidden variables are sampled from the hidden space , The encoder and the implicit variable jointly guide the decoder , The overall framework is shown in the figure below , Four in the yellow box task Is what needs to be optimized :

▲ DialogVED Pre training and fine tuning framework

Encoder
Use multiple layers Transformer Encoder to encode conversation context , In order to improve the understanding ability and noise robustness of the encoder , Use span masking The way of randomly masking part of the context . In the process, a simple method is used to cover span:1) Randomly select... In context n individual token, Expressed as S ;2) For each token , Expand it to a fixed length of m The text of span;3) In order 、 Mask all selected marks after de duplication and boundary check token. and BERT similar , Controls the masked in the context token Account for about 15%, And use covered token The encoded implicit state representation predicts itself . Only in the pre training stage , Conduct span-making Mask context operations , The loss function is :

▲ The cross entropy loss function predicted by shielding words
LSM yes log softmax function .

Hidden variables
Intuitively , The introduction of implicit variables can provide a hierarchical generation method , The high-level semantic variable is determined by using the high-level semantic , It is then decoded to generate sentence level syntactic and lexical details . Similar to the variational self encoder , Mainly minimize the following two loss items : Reconstruction losses ( Or negative log likelihood ) and K-L Regularization term :

▲ Rebuild the loss function

▲ KL Regularization term
Because in the process of training ,KL The loss will drop rapidly to 0(KL Disappearance or posterior collapse ), The implicit variable space loses its expressive ability , So this paper introduces two methods to improve KL When divergence disappears , One is Free Bits, stay KL Introduce... Into the loss Hinge loss normal form , That is, add a constant to the loss function λ, The other is Bag-of-words predicting, The loss function introduced by hindering the model to predict words with autoregressive paradigm .
stay Free Bits in , In order to allow more information to be encoded into implicit variables , Give Way KL Each dimension of the term is “ Reserve a little space ”. say concretely , If this one-dimensional KL It's too small , Don't optimize it , Wait until it increases beyond a threshold before optimizing .
stay Bag-of-words Predicting in , Let the implicit variable predict the words in the response in a non autoregressive way , That is to encourage implicit variables to contain the returned vocabulary information as much as possible . This method can be seen as increasing the weight of reconstruction loss , Let the model pay more attention to the optimization and reconstruction of loss items .


▲ The first is Free Bits Loss, The second is Bag-of-words Loss
The author adds a special classification mark at the beginning of the context [CLS], The corresponding implicit state representation is used to represent the global conversation context . It is assumed that the posterior distribution of the hidden vector is normal , And use MLP Layers will [CLS] The corresponding corresponding states are mapped to the mean and logarithmic variance of the hidden space , By sampling the normal distribution , You can get the corresponding implicit variable , Then input the implicit variable into the decoder .
Here you can see , The loss function consists of 4 Item composition :1) Cover language loss at encoder end ;2) Based on future predictions n-ngram Response generation loss ;3) A priori distribution and a posteriori distribution of hidden variables K-L Divergence loss ;4) The word bag predicts the loss .

decoder
The future prediction strategy is used in the decoder , And each time step predicts only the next token Different , Forecast the future at the same time n A future token. say concretely , The original Seq2Seq The model aims to optimize the conditional likelihood function , The optimization goal of the future prediction strategy will be , among Represents the next consecutive n A future token. In the future n-gram Predicting losses can clearly encourage models to plan for the future token forecast , And prevent over fitting of strong local correlation .
meanwhile , In the decoder ProphetNet Proposed in n-stream Self attention mechanism .n-stream The mechanism of self attention is main stream( Main stream ) In addition to n An extra self focus predicting stream( Flow prediction ), These prediction streams are used to predict n A continuous future token.Main stream and predicting stream Readers are advised to read ProphetNet [2].
Last , In order to connect the implicit variable and the decoder , Similar measures have been taken OPTIMUS [3] Proposed in Memory Scheme, That is, the implicit variable is mapped to an additional memory vector , This is an additional key value pair .memory vector It is equivalent to adding a virtual device during decoding token Participate in main stream The calculation of self attention , and predicting stream Through and with main stream Interaction is affected by memory vector Implicit influence . In this way, the implicit variable can pass through memory vector Guide the generation of each step of the decoder .

summary
This article is summarized as follows :1) A pre training dialogue model is proposed , It incorporates continuous implicit variables into the enhanced encoder - Decoder pre training framework ;2) Explored the size of implicit variables 、 Different decoding strategies 、 The impact of rounds and role location coding on the performance of the model ;3) Experiments show that , The model achieves good performance in multiple downstream tasks , Better relevance and diversity in response generation .

reference

[1] DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation
[2] Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training
[3] Optimus: Organizing sentences via pre-trained modeling of a latent space

Give Welfare !
Exclusive customization In alchemy /Fine-Tuning
Super mouse pad
limited 200 Share
Code scanning reply 「 Mouse pad 」
Participate and receive... Free of charge now

Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
·

版权声明
本文为[PaperWeekly]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231638535882.html
边栏推荐
- NVIDIA显卡驱动报错
- Questions about disaster recovery? Click here
- 1959年高考数学真题
- 基于GPU实例的Nanopore数据预处理
- On the security of key passing and digital signature
- Cartoon: what are IAAs, PAAS, SaaS?
- Qipengyuan horizon credible meta universe social system meets diversified consumption and social needs
- JMeter setting environment variable supports direct startup by entering JMeter in any terminal directory
- Zhongang Mining: Fluorite Flotation Process
- Detailed explanation of gzip and gunzip decompression parameters
猜你喜欢

Force buckle - 198 raid homes and plunder houses

ByteVCharts可视化图表库,你想要的我都有

Gartner 发布新兴技术研究:深入洞悉元宇宙

Six scenarios of cloud migration

批量制造测试数据的思路,附源码

MySQL master-slave replication

OMNeT学习之新建工程
![Knowledge points and examples of [seven input / output systems]](/img/5f/759a73836f79ef848f449930dcc1b1.png)
Knowledge points and examples of [seven input / output systems]

Pytorch: the pit between train mode and eval mode

NVIDIA graphics card driver error
随机推荐
Pycham connects to the remote server and realizes remote debugging
人脸识别框架之dlib
PHP 零基础入门笔记(13):数组相关函数
漫画:什么是IaaS、PaaS、SaaS?
NVIDIA graphics card driver error
Day 10 abnormal mechanism
vim编辑器的实时操作
Gartner 發布新興技術研究:深入洞悉元宇宙
Calculate pie chart percentage
Use if else to judge in sail software - use the title condition to judge
博士申请 | 厦门大学信息学院郭诗辉老师团队招收全奖博士/博后/实习生
七朋元视界可信元宇宙社交体系满足多元化的消费以及社交需求
MySql主从复制
∑GL-透视投影矩阵的推导
On the security of key passing and digital signature
LVM and disk quota
Zhongang Mining: Fluorite Flotation Process
How magical is the unsafe class used by all major frameworks?
Installation and management procedures
浅谈 NFT项目的价值、破发、收割之争