TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

Overview

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

Authors: Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nigel Collier

Code of our paper: TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

Introduction:

Masked language models (MLMs) such as BERT and RoBERTa have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed analysis to reveal the merits and inner-workings of our approach

Main Results:

We show the comparison between TaCL (base version) and the original BERT (base version).

(1) English benchmark results on SQuAD (Rajpurkar et al., 2018) (dev set) and GLUE (Wang et al., 2019) average score.

Model SQuAD 1.1 (EM/F1) SQuAD 2.0 (EM/F1) GLUE Average
BERT 80.8/88.5 73.4/76.8 79.6
TaCL 81.6/89.0 74.4/77.5 81.2

(2) Chinese benchmark results (test set F1) on four NER tasks (MSRA, OntoNotes, Resume, and Weibo) and three Chinese word segmentation (CWS) tasks (PKU, CityU, and AS).

Model MSRA OntoNotes Resume Weibo PKU CityU AS
BERT 94.95 80.14 95.53 68.20 96.50 97.60 96.50
TaCL 95.44 82.42 96.45 69.54 96.75 98.16 96.75

Huggingface Models:

Model Name Model Address
English (cambridgeltl/tacl-bert-base-uncased) link
Chinese (cambridgeltl/tacl-bert-base-chinese) link

Example Usage:

import torch
# initialize model
from transformers import AutoModel, AutoTokenizer
model_name = 'cambridgeltl/tacl-bert-base-uncased'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# create input ids
text = '[CLS] clbert is awesome. [SEP]'
tokenized_token_list = tokenizer.tokenize(text)
input_ids = torch.LongTensor(tokenizer.convert_tokens_to_ids(tokenized_token_list)).view(1, -1)
# compute hidden states
representation = model(input_ids).last_hidden_state # [1, seqlen, embed_dim]

Tutorial (in Chinese language) on how to use Chinese TaCL BERT to performance Name Entity Recognition and Chinese word segmentation:

Tutorial link

Tutorial on how to reproduce the results in our paper:

1. Environment Setup:

python version: 3.8
pip3 install -r requirements.txt

2. Train TaCL:

(1) Prepare pre-training data:

Please refer to details provided in ./pretraining_data directory.

(2) Train the model:

Please refer to details provided in ./pretraining directory.

3. Experiments on English Benchmarks:

Please refer to details provided in ./english_benchmark directory.

4. Experiments on Chinese Benchmarks:

(1) Chinese Benchmark Data Preparation:

chmod +x ./download_benchmark_data.sh
./download_benchmark_data.sh

(2) Fine-tuning and Inference:

Please refer to details provided in ./chinese_benchmark directory.

5. Replicate Our Analysis Results:

We provide all essential code to replicate the results (the images below) provided in our analysis section. The related codes and instructions are located in ./analysis directory. Have fun!

Citation:

If you find our paper and resources useful, please kindly cite our paper:

@misc{su2021tacl,
      title={TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning}, 
      author={Yixuan Su and Fangyu Liu and Zaiqiao Meng and Lei Shu and Ehsan Shareghi and Nigel Collier},
      year={2021},
      eprint={2111.04198},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

If you have any questions, feel free to contact me via ([email protected]).

Owner
Yixuan Su
I am a final-year PhD student at the University of Cambridge, supervised by Professor Nigel Collier.
Yixuan Su
An implementation for the loss function proposed in Decoupled Contrastive Loss paper.

Decoupled-Contrastive-Learning This repository is an implementation for the loss function proposed in Decoupled Contrastive Loss paper. Requirements P

Ramin Nakhli 71 Dec 04, 2022
Like ThreeJS but for Python and based on wgpu

pygfx A render engine, inspired by ThreeJS, but for Python and targeting Vulkan/Metal/DX12 (via wgpu). Introduction This is a Python render engine bui

139 Jan 07, 2023
Semantically Contrastive Learning for Low-light Image Enhancement

Semantically Contrastive Learning for Low-light Image Enhancement Here, we propose an effective semantically contrastive learning paradigm for Low-lig

48 Dec 16, 2022
A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

Evan 1.3k Jan 02, 2023
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

SaxonJS-Tutorial-2021, version 1.0.4 Last updated on 4 November, 2021. Table of contents Background Prerequisites Starting a web server Running a Java

Saxonica 11 Oct 23, 2022
OCRA (Object-Centric Recurrent Attention) source code

OCRA (Object-Centric Recurrent Attention) source code Hossein Adeli and Seoyoung Ahn Please cite this article if you find this repository useful: For

Hossein Adeli 2 Jun 18, 2022
Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization Official PyTorch implementation for our URST (Ultra-Resolution Sty

czczup 148 Dec 27, 2022
Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a

Tristan Croll 24 Nov 23, 2022
Implementation of "Scaled-YOLOv4: Scaling Cross Stage Partial Network" using PyTorch framwork.

YOLOv4-large This is the implementation of "Scaled-YOLOv4: Scaling Cross Stage Partial Network" using PyTorch framwork. YOLOv4-CSP YOLOv4-tiny YOLOv4-

Kin-Yiu, Wong 2k Jan 02, 2023
PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022
Efficient neural networks for analog audio effect modeling

micro-TCN Efficient neural networks for audio effect modeling

Christian Steinmetz 94 Dec 29, 2022
Workshop Materials Delivered on 28/02/2022

intro-to-cnn-p1 Repo for hosting workshop materials delivered on 28/02/2022 Questions you will answer in this workshop Learning Objectives What are co

Beginners Machine Learning 5 Feb 28, 2022
This is a five-step framework for the development of intrusion detection systems (IDS) using machine learning (ML) considering model realization, and performance evaluation.

AB-TRAP: building invisibility shields to protect network devices The AB-TRAP framework is applicable to the development of Network Intrusion Detectio

Lab-C2DC - Laboratory of Command and Control and Cyber-security 17 Jan 04, 2023
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

AugMix Introduction We propose AugMix, a data processing technique that mixes augmented images and enforces consistent embeddings of the augmented ima

Google Research 876 Dec 17, 2022
An LSTM based GAN for Human motion synthesis

GAN-motion-Prediction An LSTM based GAN for motion synthesis has a few issues reading H3.6M data from A.Jain et al , will fix soon. Prediction of the

Amogh Adishesha 9 Jun 17, 2022
Code base for NeurIPS 2021 publication titled Kernel Functional Optimisation (KFO)

KernelFunctionalOptimisation Code base for NeurIPS 2021 publication titled Kernel Functional Optimisation (KFO) We have conducted all our experiments

2 Jun 29, 2022
This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Transferability for domain generalization This repo is for evaluating and improving transferability in domain generalization (NeurIPS 2021), based on

gordon 9 Nov 29, 2022
Pytorch implementation for "Density-aware Chamfer Distance as a Comprehensive Metric for Point Cloud Completion" (NeurIPS 2021)

Density-aware Chamfer Distance This repository contains the official PyTorch implementation of our paper: Density-aware Chamfer Distance as a Comprehe

Tong WU 93 Dec 15, 2022