当前位置：网站首页>ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation
ACL 2022 | dialogved: a pre trained implicit variable encoding decoding model for dialogue reply generation
2022-04-23 16:41:47 【PaperWeekly】
author | Hanscal
Research direction | Knowledge map 、 Dialogue system
Paper title ：
DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation
Thesis link ：
This is the data intelligence and Social Computing Laboratory of Fudan University (Fudan DISC) stay ACL 2022 The work of a dialogue pre training . This paper focuses on the important research topic of how to generate relevance and diversified responses in the open domain , In this paper, a new Dialogue pre training framework DialogVED, It introduces continuous implicit variables into the enhanced encoder - Decoder frame , To improve the relevance and diversity of generated responses .
It is well known that there is one or more questions in the generated dialogue , That is, a single conversation context can follow multiple reasonable responses . The existing work introduces latent variables to simulate this problem . lately ,PLATO Discrete implicit variables are introduced into the pre training dialogue model , It also shows significant performance improvement in multiple downstream response generation tasks . In addition to discrete latent variables , Continuous latent variables are also commonly used to model one to many mappings in dialogue systems , However, the potential of combining continuous implicit variables with large-scale language pre training is rarely explored .
In this paper , The author proposes a method to introduce continuous hidden variables into the encoder - Model of decoder framework DialogVED, And optimize the following 4 A training goal for pre training ：1） Mask language model loss , To enhance the encoder's understanding of the context ,2） have n-gram The response to the loss generates a loss , To improve the planning ability of the decoder , 3）Kullback-Leibler Divergence loss , To minimize the difference between a posteriori distribution and a priori distribution of hidden variables , as well as 4） Word bag loss to reduce a posteriori distribution collapse . Besides , The effects of absolute and relative position coding on the performance of the model are also discussed .
DialogVED from Encoder 、 Decoder and implicit variables form , The encoder determines the distribution of the hidden space , Hidden variables are sampled from the hidden space , The encoder and the implicit variable jointly guide the decoder , The overall framework is shown in the figure below , Four in the yellow box task Is what needs to be optimized ：
▲ DialogVED Pre training and fine tuning framework
Use multiple layers Transformer Encoder to encode conversation context , In order to improve the understanding ability and noise robustness of the encoder , Use span masking The way of randomly masking part of the context . In the process, a simple method is used to cover span：1） Randomly select... In context n individual token, Expressed as S ;2） For each token , Expand it to a fixed length of m The text of span;3） In order 、 Mask all selected marks after de duplication and boundary check token. and BERT similar , Controls the masked in the context token Account for about 15%, And use covered token The encoded implicit state representation predicts itself . Only in the pre training stage , Conduct span-making Mask context operations , The loss function is ：
▲ The cross entropy loss function predicted by shielding words
LSM yes log softmax function .
Intuitively , The introduction of implicit variables can provide a hierarchical generation method , The high-level semantic variable is determined by using the high-level semantic , It is then decoded to generate sentence level syntactic and lexical details . Similar to the variational self encoder , Mainly minimize the following two loss items ： Reconstruction losses （ Or negative log likelihood ） and K-L Regularization term ：
▲ Rebuild the loss function
▲ KL Regularization term
Because in the process of training ,KL The loss will drop rapidly to 0（KL Disappearance or posterior collapse ）, The implicit variable space loses its expressive ability , So this paper introduces two methods to improve KL When divergence disappears , One is Free Bits, stay KL Introduce... Into the loss Hinge loss normal form , That is, add a constant to the loss function λ, The other is Bag-of-words predicting, The loss function introduced by hindering the model to predict words with autoregressive paradigm .
stay Free Bits in , In order to allow more information to be encoded into implicit variables , Give Way KL Each dimension of the term is “ Reserve a little space ”. say concretely , If this one-dimensional KL It's too small , Don't optimize it , Wait until it increases beyond a threshold before optimizing .
stay Bag-of-words Predicting in , Let the implicit variable predict the words in the response in a non autoregressive way , That is to encourage implicit variables to contain the returned vocabulary information as much as possible . This method can be seen as increasing the weight of reconstruction loss , Let the model pay more attention to the optimization and reconstruction of loss items .
▲ The first is Free Bits Loss, The second is Bag-of-words Loss
The author adds a special classification mark at the beginning of the context [CLS], The corresponding implicit state representation is used to represent the global conversation context . It is assumed that the posterior distribution of the hidden vector is normal , And use MLP Layers will [CLS] The corresponding corresponding states are mapped to the mean and logarithmic variance of the hidden space , By sampling the normal distribution , You can get the corresponding implicit variable , Then input the implicit variable into the decoder .
Here you can see , The loss function consists of 4 Item composition ：1） Cover language loss at encoder end ;2） Based on future predictions n-ngram Response generation loss ;3） A priori distribution and a posteriori distribution of hidden variables K-L Divergence loss ;4） The word bag predicts the loss .
The future prediction strategy is used in the decoder , And each time step predicts only the next token Different , Forecast the future at the same time n A future token. say concretely , The original Seq2Seq The model aims to optimize the conditional likelihood function , The optimization goal of the future prediction strategy will be , among Represents the next consecutive n A future token. In the future n-gram Predicting losses can clearly encourage models to plan for the future token forecast , And prevent over fitting of strong local correlation .
meanwhile , In the decoder ProphetNet Proposed in n-stream Self attention mechanism .n-stream The mechanism of self attention is main stream（ Main stream ） In addition to n An extra self focus predicting stream（ Flow prediction ）, These prediction streams are used to predict n A continuous future token.Main stream and predicting stream Readers are advised to read ProphetNet .
Last , In order to connect the implicit variable and the decoder , Similar measures have been taken OPTIMUS  Proposed in Memory Scheme, That is, the implicit variable is mapped to an additional memory vector , This is an additional key value pair .memory vector It is equivalent to adding a virtual device during decoding token Participate in main stream The calculation of self attention , and predicting stream Through and with main stream Interaction is affected by memory vector Implicit influence . In this way, the implicit variable can pass through memory vector Guide the generation of each step of the decoder .
This article is summarized as follows ：1） A pre training dialogue model is proposed , It incorporates continuous implicit variables into the enhanced encoder - Decoder pre training framework ;2） Explored the size of implicit variables 、 Different decoding strategies 、 The impact of rounds and role location coding on the performance of the model ;3） Experiments show that , The model achieves good performance in multiple downstream tasks , Better relevance and diversity in response generation .
 DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation
 Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training
 Optimus: Organizing sentences via pre-trained modeling of a latent space
Give Welfare ！
Exclusive customization In alchemy /Fine-Tuning
Super mouse pad
limited 200 Share
Code scanning reply 「 Mouse pad 」
Participate and receive... Free of charge now
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
- 如何建立 TikTok用户信任并拉动粉丝增长
- loggie 源码分析 source file 模块主干分析
- Dlib of face recognition framework
- NVIDIA graphics card driver error
- Installation and management procedures
- 5-minute NLP: text to text transfer transformer (T5) unified text to text task model
- Easyexcel reads the geographical location data in the excel table and sorts them according to Chinese pinyin
MySQL master-slave synchronization pit avoidance version tutorial
Construction of promtail + Loki + grafana log monitoring system
Public variables of robotframework
File upload and download of robot framework
Selenium IDE and XPath installation of chrome plug-in
Project framework of robot framework
Use case execution of robot framework
Use case labeling mechanism of robot framework
Deepinv20 installation MariaDB
Pycham connects to the remote server and realizes remote debugging
- DDT + Excel for interface test
- Pytorch: the pit between train mode and eval mode
- Introduction to how to set up LAN
- Loggie source code analysis source file module backbone analysis
- How to choose the wireless gooseneck anchor microphone and handheld microphone scheme
- How to build tiktok user trust and drive fan growth
- 【PIMF】OpenHarmony啃论文俱乐部—在ACM Survey闲逛是什么体验
- Kunteng full duplex digital wireless transceiver chip kt1605 / kt1606 / kt1607 / kt1608 is suitable for interphone scheme
- English | day15, 16 x sentence true research daily sentence (clause disconnection, modification)
- Knowledge points and examples of [seven input / output systems]
- How does flash cache data in memory?
- 博士申请 | 厦门大学信息学院郭诗辉老师团队招收全奖博士/博后/实习生
- ACL 2022 | DialogVED：用于对话回复生成的预训练隐变量编码-解码模型
- True math problems in 1959 college entrance examination
- MySQL master-slave replication
- Pseudo Distributed installation spark
- Zhongang Mining: Fluorite Flotation Process
- Dancenn: overview of byte self-developed 100 billion scale file metadata storage system
- Modify the test case name generated by DDT
- Encapsulating the logging module
- Mock test
- Mock test using postman
- Do you really understand the principle of code scanning login?
- Calculate pie chart percentage