SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Last update: Jan 02, 2023

Related tags

Overview

SNCSE

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This is the repository for SNCSE.

SNCSE aims to alleviate feature suppression in contrastive learning for unsupervised sentence embedding. In the field, feature suppression means the models fail to distinguish and decouple textual similarity and semantic similarity. As a result, they may overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. And the models may underestimate the semantic similarity of pairs with less words in common. (Please refer to Section 5 of our paper for several instances and detailed analysis.) To this end, we propose to take the negation of original sentences as soft negative samples, and introduce them into the traditional contrastive learning framework through bidirectional margin loss (BML). The structure of SNCSE is as follows:

The performance of SNCSE on STS task with different encoders is:

To reproduce above results, please download the files and unzip it to replace the original file folder. Then download the models, modify the file path variables and run:

python bert_prediction.py
python roberta_prediction.py

To train SNCSE, please download the training file, and put it at /SNCSE/data. You can either run:

python generate_soft_negative_samples.py

to generate soft negative samples, or use our files in /Files/soft_negative_samples.txt. Then you may modify and run train_SNCSE.sh.

To evaluate the checkpoints saved during training on the development set of STSB task, please run:

python bert_evaluation.py
python roberta_evaluation.py

Feel free to contact the authors at [email protected] for any questions.

Please cite SNCSE as

{

Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, Jing Shao.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples.

CoRR, abs/2201.05979, 2022.

}

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Related tags

Overview

SNCSE

Owner

Sense-GVT

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

Codes for processing meeting summarization datasets AMI and ICSI.

Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

An Explainable Leaderboard for NLP

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

Data manipulation and transformation for audio signal processing, powered by PyTorch

Utilize Korean BERT model in sentence-transformers library

A highly sophisticated sequence-to-sequence model for code generation

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

📔️ Generate a text-based journal from a template file.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Fast topic modeling platform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

🌐 Translation microservice powered by AI

多语言降噪预训练模型MBart的中文生成任务

txtai: Build AI-powered semantic search applications in Go