source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Last update: Dec 17, 2022

Related tags

Overview

WhiteningBERT

Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Preparation

git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation

Usage

Datasets

We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.

The processed data can be found in ./examples/datasets/.

Run

To run a quick demo:

python evaluation_stsbenchmark.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased

Specify --pooing with cls or aver to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num to combine layers, separated by a comma.

To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:

python evaluation_stsbenchmark_layer2.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased

To enumerate all possible combinations of N layers:

python evaluation_stsbenchmark_layerN.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased\
			--combination_num 4

You can also save the embeddings of the sentences

python evaluation_stsbenchmark_save_embed.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased \
			--summary_dir ./save_embeddings

A list of PLMs you can select:

bert-base-uncased , bert-large-uncased
roberta-base, roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1 , albert-large-v1
microsoft/layoutlm-base-uncased , microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased , SpanBERT/spanbert-large-cased
microsoft/deberta-base , microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
......

Acknowledgements

Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Related tags

Overview

WhiteningBERT

Preparation

Usage

Datasets

Run

A list of PLMs you can select:

Acknowledgements

Owner

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Automated question generation and question answering from Turkish texts using text-to-text transformers

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

lightweight, fast and robust columnar dataframe for data analytics with online update

Finally, some decent sample sentences

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Trex is a tool to match semantically similar functions based on transfer learning.

New Modeling The Background CodeBase

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

code for modular summarization work published in ACL2021 by Krishna et al

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Implementation of Multistream Transformers in Pytorch