Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

Overview

LADA

This repo contains codes for the following paper:

Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augmentation for Semi-supervised NER. In Proceedings of The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP'2020)

If you would like to refer to it, please cite the paper mentioned above.

Getting Started

These instructions will get you running the codes of LADA.

Requirements

  • Python 3.6 or higher
  • Pytorch >= 1.4.0
  • Pytorch_transformers (also known as transformers)
  • Pandas, Numpy, Pickle, faiss, sentence-transformers

Code Structure

├── code/
│   ├── BERT/
│   │   ├── back_translate.ipynb --> Jupyter Notebook for back translating the dataset
│   │   ├── bert_models.py --> Codes for LADA-based BERT models
│   │   ├── eval_utils.py --> Codes for evaluations
│   │   ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│   │   ├── read_data.py --> Codes for data pre-processing
│   │   ├── train.py --> Codes for trianing BERT model
│   │   └── ...
│   ├── flair/
│   │   ├── train.py --> Codes for trianing flair model
│   │   ├── knn.ipynb --> Jupyter Notebook for building the knn index file
│   │   ├── flair/ --> the flair library
│   │   │   └── ...
│   │   ├── resources/
│   │   │   ├── docs/ --> flair library docs
│   │   │   ├── taggers/ --> save evaluation results for flair model
│   │   │   └── tasks/
│   │   │       └── conll_03/
│   │   │           ├── sent_id_knn_749.pkl --> knn index file
│   │   │           └── ... -> CoNLL-2003 dataset
│   │   └── ...
├── data/
│   └── conll2003/
│       ├── de.pkl -->Back translated training dataset with German as middle language
│       ├── labels.txt --> label index file
│       ├── sent_id_knn_700.pkl
│       └── ...  -> CoNLL-2003 dataset
├── eval/
│   └── conll2003/ --> save evaluation results for BERT model
└── README.md

BERT models

Downloading the data

Please download the CoNLL-2003 dataset and save under ./data/conll2003/ as train.txt, dev.txt, and test.txt.

Pre-processing the data

We utilize Fairseq to perform back translation on the training dataset. Please refer to ./code/BERT/back_translate.ipynb for details.

Here, we have put one example of back translated data, de.pkl, in ./data/conll2003/ . You can directly use it for CoNLL-2003 or generate your own back translated data following ./code/BERT/back_translate.ipynb.

We also provide the kNN index file for the first 700 training sentences (5%) ./data/conll2003/sent_id_knn_700.pkl. You can directly use it for CoNLL-2003 or generate your own kNN index file following ./code/BERT/knn.ipynb

Training models

These section contains instructions for training models on CoNLL-2003 using 5% training data.

Training BERT+Intra-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 

Training BERT+Inter-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \ 
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \ 
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1  

Training BERT+Semi-Intra-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio 1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin' 

Training BERT+Semi-Inter-LADA model

python ./code/BERT/train.py --data-dir 'data/conll2003' --model-type 'bert' \
--model-name 'bert-base-multilingual-cased' --output-dir 'eval/conll2003' --gpu '0,1' \
--labels 'data/conll2003/labels.txt' --max-seq-length 164 --overwrite-output-dir \
--do-train --do-eval --do-predict --evaluate-during-training --batch-size 16 \
--num-train-epochs 20 --save-steps 750 --seed 1 --train-examples 700  --eval-batch-size 128 \ 
--pad-subtoken-with-real-label --eval-pad-subtoken-with-first-subtoken-only --label-sep-cls \
--mix-layers-set 8 9 10  --beta 1.5 --alpha 60  --mix-option --use-knn-train-data \
--num-knn-k 5 --knn-mix-ratio 0.5 --intra-mix-ratio -1 \
--u-batch-size 32 --semi --T 0.6 --sharp --weight 0.05 --semi-pkl-file 'de.pkl' \
--semi-num 10000 --semi-loss 'mse' --ignore-last-n-label 4  --warmup-semi --num-semi-iter 1 \
--semi-loss-method 'origin' 

flair models

flair is a BiLSTM-CRF sequence labeling model, and we provide code for flair+Inter-LADA

Downloading the data

Please download the CoNLL-2003 dataset and save under ./code/flair/resources/tasks/conll_03/ as eng.train, eng.testa (dev), and eng.testb (test).

Pre-processing the data

We also provide the kNN index file for the first 749 training sentences (5%, including the -DOCSTART- seperator) ./code/flair/resources/tasks/conll_03/sent_id_knn_749.pkl. You can directly use it for CoNLL-2003 or generate your own kNN index file following ./code/flair/knn.ipynb

Training models

These section contains instructions for training models on CoNLL-2003 using 5% training data.

Training flair+Inter-LADA model

CUDA_VISIBLE_DEVICES=1 python ./code/flair/train.py --use-knn-train-data --num-knn-k 5 \
--knn-mix-ratio 0.6 --train-examples 749 --mix-layer 2  --mix-option --alpha 60 --beta 1.5 \
--exp-save-name 'mix'  --mini-batch-size 64  --patience 10 --use-crf 
Owner
GT-SALT
Social and Language Technologies Lab
GT-SALT
[ICCV'21] PlaneTR: Structure-Guided Transformers for 3D Plane Recovery

PlaneTR: Structure-Guided Transformers for 3D Plane Recovery This is the official implementation of our ICCV 2021 paper News There maybe some bugs in

73 Nov 30, 2022
Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Dual Self-Attention Network for Multivariate Time Series Forecasting 20.10.26 Update: Due to the difficulty of installation and code maintenance cause

Kyon Huang 223 Dec 16, 2022
nfelo: a power ranking, prediction, and betting model for the NFL

nfelo nfelo is a power ranking, prediction, and betting model for the NFL. Nfelo take's 538's Elo framework and further adapts it for the NFL, hence t

6 Nov 22, 2022
Anonymous implementation of KSL

k-Step Latent (KSL) Implementation of k-Step Latent (KSL) in PyTorch. Representation Learning for Data-Efficient Reinforcement Learning [Paper] Code i

1 Nov 10, 2021
Autonomous Driving on Curvy Roads without Reliance on Frenet Frame: A Cartesian-based Trajectory Planning Method

C++/ROS Source Codes for "Autonomous Driving on Curvy Roads without Reliance on Frenet Frame: A Cartesian-based Trajectory Planning Method" published in IEEE Trans. Intelligent Transportation Systems

Bai Li 88 Dec 23, 2022
Implementations of paper Controlling Directions Orthogonal to a Classifier

Classifier Orthogonalization Implementations of paper Controlling Directions Orthogonal to a Classifier , ICLR 2022, Yilun Xu, Hao He, Tianxiao Shen,

Yilun Xu 33 Dec 01, 2022
Implementing DeepMind's Fast Reinforcement Learning paper

Fast Reinforcement Learning This is a repo where I implement the algorithms in the paper, Fast reinforcement learning with generalized policy updates.

Marcus Chiam 6 Nov 28, 2022
Fast Soft Color Segmentation

Fast Soft Color Segmentation

3 Oct 29, 2022
UCSD Oasis platform

oasis UCSD Oasis platform Local project setup Install Docker Compose and make sure you have Pip installed Clone the project and go to the project fold

InSTEDD 4 Jun 16, 2021
This is a official repository of SimViT.

SimViT This is a official repository of SimViT. We will open our models and codes about object detection and semantic segmentation soon. Our code refe

ligang 57 Dec 15, 2022
Weakly-supervised object detection.

Wetectron Wetectron is a software system that implements state-of-the-art weakly-supervised object detection algorithms. Project CVPR'20, ECCV'20 | Pa

NVIDIA Research Projects 342 Jan 05, 2023
The official codes for the ICCV2021 presentation "Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting"

UEPNet (ICCV2021 Poster Presentation) This repository contains codes for the official implementation in PyTorch of UEPNet as described in Uniformity i

Tencent YouTu Research 15 Dec 14, 2022
python debugger and anti-vm that checks if you're in a virtual machine or if someones trying to debug your file

Anti-Debug was made by Love ❌ code ✅ 🎉 ・What it checks for ・ Kills tools that can be used to debug your file ・ Exits if ran in vm (supports different

Rdimo 31 Aug 09, 2022
EPSANet:An Efficient Pyramid Split Attention Block on Convolutional Neural Network

EPSANet:An Efficient Pyramid Split Attention Block on Convolutional Neural Network This repo contains the official Pytorch implementaion code and conf

Hu Zhang 175 Jan 07, 2023
Raindrop strategy for Irregular time series

Graph-Guided Network For Irregularly Sampled Multivariate Time Series Overview This repository contains processed datasets and implementation code for

Zitnik Lab @ Harvard 74 Jan 03, 2023
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
JugLab 33 Dec 30, 2022
A Keras implementation of CapsNet in the paper: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules

NOTE This implementation is fork of https://github.com/XifengGuo/CapsNet-Keras , applied to IMDB texts reviews dataset. CapsNet-Keras A Keras implemen

Lauro Moraes 5 Oct 23, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 01, 2023
This repo is customed for VisDrone.

Object Detection for VisDrone(无人机航拍图像目标检测) My environment 1、Windows10 (Linux available) 2、tensorflow = 1.12.0 3、python3.6 (anaconda) 4、cv2 5、ensemble

53 Jul 17, 2022