Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

arXiv

This is the code base for weakly supervised NER.

We provide a three stage framework:

  • Stage I: Domain continual pre-training;
  • Stage II: Noise-aware weakly supervised pre-training;
  • Stage III: Fine-tuning.

In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.

See details in arXiv.

Performance Benchmark

BioMedical NER

Method (F1) BC5CDR-chem BC5CDR-disease NCBI-disease
BERT 89.99 79.92 85.87
bioBERT 92.85 84.70 89.13
PubMedBERT 93.33 85.62 87.82
Ours 94.17 90.69 92.28

See more in bio_script/README.md

Dependency

pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7

Install requirements

pip install -r requirements.txt

(If the allennlp and transformers are incompatible, install allennlp first and then update transformers. Since we only use some small functions of allennlp, it should works fine. )

File Structure:

├── bert-ner          #  Python Code for Training NER models
│   └── ...
└── bio_script        #  Shell Scripts for Training BioMedical NER models
    └── ...

Usage

See examples in bio_script

Hyperparameter Explaination

Here we explain hyperparameters used the scripts in ./bio_script.

Training Scripts:

Scripts

  • roberta_mlm_pretrain.sh
  • weak_weighted_selftrain.sh
  • finetune.sh

Hyperparameter

  • GPUID: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3.
  • MASTER_PORT: automatically constructed (avoid conflicts) for distributed training.
  • DISTRIBUTE_GPU: use distributed training or not
  • PROJECT_ROOT: automatically detected, the root path of the project folder.
  • DATA_DIR: Directory of the training data, where it contains train.txt test.txt dev.txt labels.txt weak_train.txt (weak data) aug_train.txt (optional).
  • USE_DA: if augment training data by augmentation, i.e., combine train.txt + aug_train.txt in DATA_DIR for training.
  • BERT_MODEL: the model backbone, e.g., roberta-large. See transformers for details.
  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • LOSSFUNC: nll the normal loss function, corrected_nll noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).
  • MAX_WEIGHT: The maximum weight of a sample in the loss.
  • MAX_LENGTH: max sentence length.
  • BATCH_SIZE: batch size per GPU.
  • NUM_EPOCHS: number of training epoches.
  • LR: learning rate.
  • WARMUP: learning rate warmup steps.
  • SAVE_STEPS: the frequency of saving models.
  • EVAL_STEPS: the frequency of testing on validation.
  • SEED: radnom seed.
  • OUTPUT_DIR: the directory for saving model and code. Some parameters will be automatically appended to the path.
    • roberta_mlm_pretrain.sh: It's better to manually check where you want to save the model.]
    • finetune.sh: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx.
    • weak_weighted_selftrain.sh: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx (see FBA_RULE below)

There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh).

Profiling Script

Scripts

  • profile.sh

Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py but only do evaluation.

Hyperparameter Basically the same as training script.

  • PROFILE_FILE: can be train,dev,test or a specific path to a txt data. E.g., using Weak by

    PROFILE_FILE=weak_train_100.txt PROFILE_FILE=$DATA_DIR/$PROFILE_FILE

  • OUTPUT_DIR: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile

Weakly Supervised Data Refinement Script

Scripts

  • profile2refinedweakdata.sh

Hyperparameter

  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • WEI_RULE: rule for generating weight for each weak sample.
    • uni: all are 1
    • avgaccu: confidence estimate for new labels generated by all_overwrite
    • avgaccu_weak_non_O_promote: confidence estimate for new labels generated by non_O_overwrite
  • PRED_RULE: rule for generating new weak labels.
    • non_O_overwrite: non-entity ('O') is overwrited by prediction
    • all_overwrite: all use prediction, i.e., self-training
    • no: use original weak labels
    • non_O_overwrite_all_overwrite_over_accu_xx: non_O_overwrite + if confidence is higher than xx all tokens use prediction as new labels

The generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE} WEAK_RULE specified in weak_weighted_selftrain.sh is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}.

More Rounds of Training, Try Different Combination

  1. To do training with weakly supervised data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Create profile data, e.g., run ./bio_script/profile.sh for dev set and weak set
  • iii) Generate data with weak labels from profile data, e.g., run ./bio_script/profile2refinedweakdata.sh. You can use different rules to generate weights for each sample (WEI_RULE) and different rules to refine weak labels (PRED_RULE). See more details in ./ber-ner/profile2refinedweakdata.py
  • iv) Do training with ./bio_script/weak_weighted_selftrain.sh.
  1. To do fine-tuning with human labeled data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Run ./bio_script/finetune.sh.

Reference

@inproceedings{Jiang2021NamedER,
  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
  booktitle={ACL/IJCNLP},
  year={2021}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Owner
Amazon
Amazon
Rax is a Learning-to-Rank library written in JAX

🦖 Rax: Composable Learning to Rank using JAX Rax is a Learning-to-Rank library written in JAX. Rax provides off-the-shelf implementations of ranking

Google 247 Dec 27, 2022
deep learning model that learns to code with drawing in the Processing language

sketchnet sketchnet - processing code generator can we teach a computer to draw pictures with code. We use Processing and java/jruby code paired with

41 Dec 12, 2022
Learning to Segment Instances in Videos with Spatial Propagation Network

Learning to Segment Instances in Videos with Spatial Propagation Network This paper is available at the 2017 DAVIS Challenge website. Check our result

Jingchun Cheng 145 Sep 28, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 62 Dec 14, 2022
A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

pytorch-lifestream a library built upon PyTorch for building embeddings on discrete event sequences using self-supervision. It can process terabyte-si

Dmitri Babaev 103 Dec 17, 2022
Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet)

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet) By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu. Unive

Lele Chen 218 Dec 27, 2022
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
Simple PyTorch hierarchical models.

A python package adding basic hierarchal networks in pytorch for classification tasks. It implements a simple hierarchal network structure based on feed-backward outputs.

Rajiv Sarvepalli 5 Mar 06, 2022
A set of tools for Namebase and HNS

HNS-TOOLS A set of tools for Namebase and HNS To install: pip install -r requirements.txt To run: py main.py My Namebase referral code: http://namebas

RunDavidMC 7 Apr 08, 2022
This repository consists of Blender python scripts and corresponding assets to generate variants of the CANDLE dataset

candle-simulator This repository consists of Blender python scripts and corresponding assets to generate variants of the IITH-CANDLE dataset. The rend

1 Dec 15, 2021
A Python implementation of active inference for Markov Decision Processes

A Python package for simulating Active Inference agents in Markov Decision Process environments. Please see our companion preprint on arxiv for an ove

235 Dec 21, 2022
Auditing Black-Box Prediction Models for Data Minimization Compliance

Data-Minimization-Auditor An auditing tool for model-instability based data minimization that is introduced in "Auditing Black-Box Prediction Models f

Bashir Rastegarpanah 2 Mar 24, 2022
Py-FEAT: Python Facial Expression Analysis Toolbox

Py-FEAT is a suite for facial expressions (FEX) research written in Python. This package includes tools to detect faces, extract emotional facial expressions (e.g., happiness, sadness, anger), facial

Computational Social Affective Neuroscience Laboratory 147 Jan 06, 2023
OCR-D wrapper for detectron2 based segmentation models

ocrd_detectron2 OCR-D wrapper for detectron2 based segmentation models Introduction Installation Usage OCR-D processor interface ocrd-detectron2-segm

Robert Sachunsky 13 Dec 06, 2022
Reinforcement Learning with Q-Learning Algorithm on gym's frozen lake environment implemented in python

Reinforcement Learning with Q Learning Algorithm Q learning algorithm is trained on the gym's frozen lake environment. Libraries Used gym Numpy tqdm P

1 Nov 10, 2021
Nsdf: A mesh SDF with just some code we can directly paste into our raymarcher

nsdf Representing SDFs of arbitrary meshes has been a bit tricky so far. Express

Jan Ivanecky 5 Feb 18, 2022
Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

Peng Qiao 1 Dec 14, 2021
Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022
GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape Completion

GarmentNets This repository contains the source code for the paper GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape

Columbia Artificial Intelligence and Robotics Lab 43 Nov 21, 2022
Open source repository for the code accompanying the paper 'PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations'.

PatchNets This is the official repository for the project "PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations". For details,

16 May 22, 2022