当前位置：网站首页>[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

2022-08-10 19:26:00 【BQW_】

PairSupCon：Pairwise Supervised Contrastive Learning for Sentence Representations《Pairwise Supervised Contrastive Learning of Sentence Representations》

论文地址：https://arxiv.org/pdf/2109.05424.pdf

相关博客：
【自然语言处理】【对比学习】SimCSE：基于对比学习的句向量表示
 【自然语言处理】BERT-Whitening
【自然语言处理】【Pytorch】从头实现SimCSE
【自然语言处理】【向量检索】面向开发域稠密检索的多视角文档表示学习
 【自然语言处理】【向量表示】AugSBERT：改善用于成对句子评分任务的Bi-Encoders的数据增强方法
 【自然语言处理】【向量表示】PairSupCon：Pairwise Supervised Contrastive Learning for Sentence Representations

一、简介

Learning high-quality sentence embeddings is $\text{NLP}$ 中的基础任务.The goal is to map similar sentences to close locations in the representation space,Map dissimilar sentences to distant locations.A recent study passed in $\text{NLI}$ The training on the dataset was successful,The task on this dataset is to classify sentence pairs into one of three categories：entailment、contradiction或者neutral.

Although the results are not bad,But previous work has a drawback：构成contradictionThe right sentences may and need to belong to different semantic categories.因此,Differentiate by optimizing the modelentailment和contradiction,It is not enough for the model to encode high-level category concepts.此外,标准的siamese(triplet)The loss function can only learn from independent sentence pairs $(\text{triplets})$ 中学习,It requires a large number of training samples to achieve competitive results.siameseLoss functions can sometimes bring the model into poor local optima,Its effect on high-level semantic concept encoding will be degraded.

在本文中,Inspired by self-supervised contrastive learning,And proposed joint optimization with instance discrimination $\text{(instance discrimination)}$ The pairwise semantic inference objective function.The authors call this method $\text{Pairwise Supervised Contrastive Learning(PairSupCon)}$ .As mentioned in some recent research work,instance discrimination learningAbility to group similar instances nearby in the representation space without any explicit guidance. $\text{PairSupCon}$ Take advantage of this implicit grouping effect,Group together representations of the same class,While enhancing the semantics of the modelentailment和contradiction推理能力.

While previous work mainly focused on pairwise evaluation of semantic similarity.在本文中,The authors argue that encoding high-level semantic concepts into vector representations is also an important evaluation aspect.Previously on the Semantic Text Similarity Task $\text{STS}$ The best-performing model on top of the class leads to the degradation of the category semantic structure embeddings.另一方面,Better capture of high-level semantic concepts can in turn facilitate lower-level semanticsentailment和contradiction推理的效果.This assumption is consistent with humans classifying from the top down. $\text{PairSupCon}$ 在8Averaging is achieved in two short text clustering tasks10%-13%的改善,并且在 $\text{STS}$ 任务上实现了5%-6%的改善.

二、方法

遵循 $\text{SBERT}$ ,采用 $\text{SNLI}$ 和 $\text{MNLI}$ 作为训练数据,And for convenience the merged data is called $\text{NLI}$ . $\text{NLI}$ The data consists of labeled sentence pairs,And each sample is of the form ： $\text{(premise,hypothesis,label)}$ .每个 $\text{premise}$ Sentences are all selected from existing text sources,并且每个 $\text{premise}$ There will be various manual annotations $\text{hypothesis}$ Sentences form a pair.每个 $\text{label}$ 都表示 $\text{hypothesis}$ types and classifications $\text{premise}$ 和 $\text{hypothesis}$ The semantic relations of sentence pairs are of three types： $\text{entailment}$ 、 $\text{contradiction}$ 和 $\text{neural}$ .

The previous work will be there $\text{NLI}$ independently optimizedsiamese loss或者triplet loss.The authors aim to exploit the implicit grouping effect in discriminative learning to better capture the high-level category semantic structure of the data,Simultaneously facilitates semantic text at a low levelentailment和contradictionBetter convergence of the recommended objective.

1. 实例判别( $\text{Instance Discrimination}$ )

作者利用 $\text{NLI}$ positive sample pair $\text{(entailment)}$ to optimize the instance-level discriminative objective function,It tries to distance each positive pair from other sentences.令 $\mathcal{D}=\{(x_j,x_j'),y_j\}_{j=1}^M$ Indicates random samplingminibatch,其中 $y_i=\pm 1$ 表示entailment或者contradiction.对于正样本对 $x_i,x_i')$ 中的premise句子 $x_i$ ,The goal here is tohypothesis句子 $x_i'$ 与同一个batch $\mathcal{D}$ 中的 $\text{2M-2}$ separate sentences.具体来说,令 $\mathcal{I}=\{i,i'\}_{i}^M$ 表示 $\mathcal{D}$ The index corresponding to the sample in ,Minimize the loss function below：
$\mathcal{l}_{\text{ID}}^i=-\log\frac{\exp(s(z_i,z_{i'}/\tau))}{\sum_{j\in\mathcal{I}\setminus i}\exp(s(z_i,z_{j}/\tau)} \tag{1}$
在上面的等式中, $z_j=h(\psi(x_j))$ Represents the output of the entity judgment header, $\tau$ 表示温度参数, $s(\cdot)$ 是cosine相似的,即 $s(\cdot)=z_i^\top z_{i'}/\parallel z_i\parallel\parallel z_{i'}\parallel$ .等式(1)can be interpreted for classification $z_i$ 和 $z_i'$ of the classification loss functionsoftmax.

类似地,对于hypothesis句子 $x_{i'}$ ,Try it from here $\mathcal{D}$ Discriminate in all other sentences in premise句子 $x_i$ .因此,Define the corresponding loss function $\mathcal{l}_{\text{ID}}^{i'}$ 为等式(1)Exchange instance $x_{i'}$ 和 $x_i$ 的角色.总的来说,The final loss function is the average $\mathcal{D}$ all positive samples in .
$\mathcal{L}_{\text{ID}}=\frac{1}{P_M}\sum_{i=1}^M\mathbb{1}_{y_i=1}\cdot(\mathcal{l}_{\text{ID}}^i+\mathcal{l}_{\text{ID}}^{i'}) \tag{2}$
这里, $\mathbb{1}_{(\cdot)}$ 表示指示函数, $P_M$ 是 $\mathcal{D}$ 中的正样本数量.Optimizing the above loss function not only helps to implicitly encode category semantic information into the vector representation,It can also better promote the pair-wise semantic reasoning ability.

2. Hard negative sample learning

等式(1)可以被重写为
$\mathcal{l}_{\text{ID}}^i=\log\Big(1+\sum_{j\neq i,i'}\exp[\frac{s(z_i,z_j)-s(z_i,z_{i'})}{\tau}] \Big)$
It can be seen as a reference to the standardtriplet loss的扩展,通过将minibatch内的 $\text{2M-2}$ 个样本作为负样本.然而,Negative samples are uniformly sampled from the training data,It ignores the amount of information these samples contain.理想情况下,Difficult negative samples from different semantic groups but close mapping should be separated.虽然在 $\text{NLI}$ There is no category-level supervision in ,But the importance of negative samples can be approximated by the following method.
$\mathcal{l}_{\text{wID}}^i=\log\Big(1+\sum_{j\neq i,i'}\exp[\frac{\alpha_js(z_i,z_j)-s(z_i,z_{i'})}{\tau}] \Big) \tag{3}$
这里, $\alpha_j=\frac{\exp(S(z_i,z_h)/tau)}{\frac{1}{2M-2}\sum_{k\neq i,i'}\exp(S(z_i,z_k)/\tau)}$ ,It can be interpreted as targeting $z_i$ , $z_j$ 在所有 $2 M - 2$ The relative importance of the negative samples.The importance is based on assumptions：Difficult negative samples are those that are identical in representation space $z_i$ closer sample.

3. Entailment and Contradiction Reasoning

The instance discriminative loss function is mainly used to separate positive sample pairs from other sample pairs,But there is no clear mandate to judgecontradiction和entailment.为了这个目的,Joint optimization in pairsentailment和contradictionInference objective function.这里采用基于softmaxThe cross-entropy loss function is used to form the pairwise classification objective function.令 $u_i=\psi(x_i)$ 代表句子 $x_i$ 的向量表示,for each labeled sentence pair $x_i,x_{i'},y_i)$ ,Minimize the loss function below

$\mathcal{l}_{C}^i=\text{CE}(f(u_i,u_{i'},|u_i-u_{i'}|),y_i) \tag{4}$
这里 $f$ Represents a linear classification head, $\text{CE}$ 是交叉熵损失函数.不同于先前的工作,本工作将neuralSample pairs are removed from the original training set,And focus on semanticsentailment和contradictionon the binary classification problem.这样做的动机是：neuralIt can be captured by the instance discriminative loss function.因此,这里移除了neuralsample pair $\text{PairSupCon}$ The complexity of the two loss functions,and improve learning efficiency.

总的损失函数
$\mathcal{L}=\sum_{i=1}^M \mathcal{l}_C^i+\beta\mathbb{1}_{y_i=1}\cdot(\mathcal{l}_{\text{wID}}^i+\mathcal{l}_{\text{wID}}^{i'}) \tag{5}$
其中, $\mathcal{l}_C^i$ 和 $\mathcal{l}_{\text{wID}}^i$ , $\mathcal{l}_{\text{wID}}^{i'}$ 由等式(4)和等式(3)定义.在等式(5)中, $\beta$ is a balanced hyperparameter.

三、实验

请添加图片描述

原网站

版权声明
本文为[BQW_]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/222/202208101840139237.html

当前位置：网站首页>[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

一、简介

二、方法

1. 实例判别( $\text{Instance Discrimination}$ )

2. Hard negative sample learning

3. Entailment and Contradiction Reasoning

三、实验

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

[Natural Language Processing] [Vector Representation] PairSupCon: Pairwise Supervised Contrastive Learning for Sentence Representation

一、简介

二、方法

1. 实例判别( Instance Discrimination \text{Instance Discrimination} Instance Discrimination)

2. Hard negative sample learning

3. Entailment and Contradiction Reasoning

三、实验

边栏推荐

猜你喜欢

随机推荐

1. 实例判别( $\text{Instance Discrimination}$ )