This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview

GeNER

This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview of GeNER

GeNER allows you to build NER models for specific entity types of interest without human-labeled data and and rich dictionaries. The core idea is to ask simple natural language questions to an open-domain question answering (QA) system and then retrieve phrases and sentences, as shown in the query formulation and retrieval stages in the figure below. Please see our paper (Simple Questions Generate Named Entity Recognition Datasets) for details.

Requirements

Please follow the instructions below to set up your environment and install GeNER.

# Create a conda virtual environment
conda create -n GeNER python=3.8
conda activate GeNER

# Install PyTorch
conda install pytorch=1.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge

# Install GeNER
git clone https://github.com/dmis-lab/GeNER.git
cd GeNER
pip install -r requirements.txt

NER Benchmarks

Run unzip data/benchmarks.zip -d ./data to unpack (pre-processed) NER benchmarks.

QA Model and Phrase Index: DensePhrases

We use DensePhrases and a Wikipedia index precomputed by DensePhrases in order to automatically generate NER datasets. After installing DensePhrases v1.0.0, please download the DensePhrases model (densephrases-multi-query-multi) and the phrase index (densephrases-multi_wiki-20181220) in the official DensePhrases repository.

AutoPhrase (Optional)

Using AutoPhrase in the dictionary matching stage usually improves final NER performance. If you are using AutoPhrase to apply Rule 10 (i.e., refining entity boundaries), please check the system requirements in the AutoPhrase repository. If you are not using AutoPhrase, set refine_boundary to false in a configuration file in the configs directory.

Computational Resource

Please see the resource requirement of DensePhrases and self-training, and check available resources of your machine.

  • 100GB RAM and a single 11G GPU to run DensePhrases
  • Single 9G GPU to perform self-training (based on batch size 16)

Reproducing Experiments

GeNER is implemented as a pipeline of DensePhrases, dictionary matching, and AutoPhrase. The entire pipeline is controlled by configuration files located in the configs directory. Please see configs/README.md for details.

We have already set up configuration files and optimal hyperparameters for all benchmarks and experiments so that you can easily reproduce similar or better performance to those presented in our paper. Just follow the instructions below for reproduction!

Example: low-resource NER (CoNLL-2003)

This example is intended to reproduce the experiment in the low-resource NER setting on the CoNLL-2003 benchmark. If you want to reproduce other experiments, you will need to change some arguments including --gener_config_path according to the target benchmark.

Retrieval

Running retrieve.py will create *.json and *.raw files in the data/retrieved/conll-2003 directory.

export CUDA_VISIBLE_DEVICES=0
export DENSEPHRASES_PATH={enter your densephrases path here}
export CONFIG_PATH=./configs/conll_config.json

python retrieve.py \
      --run_mode eval \
      --model_type bert \
      --cuda \
      --aggregate \
      --truecase \
      --return_sent \
      --pretrained_name_or_path SpanBERT/spanbert-base-cased \
      --dump_dir $DENSEPHRASES_PATH/outputs/densephrases-multi_wiki-20181220/dump/ \
      --index_name start/1048576_flat_OPQ96  \
      --load_dir $DENSEPHRASES_PATH/outputs/densephrases-multi-query-multi/  \
      --gener_config_path $CONFIG_PATH

Applying AutoPhrase (optional)

apply_autophrase.sh takes as input all *.raw files in the data/retrieved/conll-2003 directory and outputs *.autophrase files in the same directory.

bash autophrase/apply_autophrase.sh data/retrieved/conll-2003

Dictionary matching

Running annotate.py will create train.json and train_hf.json files in the data/annotated/conll-2003 directory. The first JSON file is used in this repository, especially in the self-training stage. The second one has the same data format as the Hugging Face Transformers library and is provided for your convenience.

python annotate.py --gener_config_path $CONFIG_PATH

Self-training

Finally, you can get the final NER model and see its performance. The model and training logs are stored in the ./outputs directory. See the Makefile file for running experiments on other benchmarks.

make conll-low

Fine-tuning GeNER

While GeNER performs well without any human-labeled data, you can further boost GeNER's performance using some training examples. The way to do this is very simple: load a trained GeNER model from the ./outputs directory and fine-tune it on training examples you have by a standard NER objective (i.e., token classification). We provide a fine-tuning script in this repository (self-training/run_ner.py) and datasets to reproduce fine-grained and few-shot NER experiments (data/fine-grained and data/few-shot directories).

export CUDA_VISIBLE_DEVICES=0

python self-training/run_ner.py \
      --data_dir data/few-shot/conll-2003/conll-2003_0 \
      --model_type bert \
      --model_name_or_path outputs/{enter GeNER model path here} \
      --output_dir outputs/{enter GeNER model path here} \
      --num_train_epochs 100 \
      --per_gpu_train_batch_size 64 \
      --per_gpu_eval_batch_size 64 \
      --learning_rate 1e-5 \
      --do_train \
      --do_eval \
      --do_test \
      --evaluate_during_training

# Note that this hyperparameter setup may not be optimal. It is recommended to search for more effective hyperparameters, especially the learning rate.

Building NER Models for Your Specific Needs

The main benefit of GeNER is that you can create NER datasets of new and different entity types you want to extract. Suppose you want to extract fighter aircraft names. The first thing you have to do is to formulate your needs as natural language questions such as "Which fighter aircraft?." At this stage, we recommend using the DensePhrases demo to manually check the feasibility of your questions. If relevant phrases are retrieved well, you can proceed to the next step.

Next, you should make a configuration file (e.g., fighter_aircraft_config.json) and set up its values. You can reflect questions you made in the configuration file as follows: "subtype": "fighter aircraft". Also, you can fine-tune some hyperparameters such as top_k and normalization rules. See configs/README.md for detailed descriptions of configuration files.

{
    "retrieved_path": "data/retrieved/{file name}",
    "annotated_path": "data/annotated/{file name}",
    "add_abbreviation": true,
    "refine_boundary" : true,
    "subquestion_configs": [
        {
            "type": "{the name of pre-defined entity type}",
            "subtype" : "fighter aircraft",
            "top_k" : 5000,
            "split_composite_mention": true,
            "remove_lowercase_phrase": true,
            "remove_the": false,
            "skip_lowercase_ngram": 1
        }
    ]
}

For subsequent steps (i.e., retrieval, dictionary matching, and self-training), refer to the CoNLL-2003 example described above.

References

Please cite our paper if you consider GeNER to be related to your work. Thanks!

@article{kim2021simple,
      title={Simple Questions Generate Named Entity Recognition Datasets}, 
      author={Hyunjae Kim and Jaehyo Yoo and Seunghyun Yoon and Jinhyuk Lee and Jaewoo Kang},
      year={2021},
      eprint={2112.08808},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Feel free to email Hyunjae Kim ([email protected]) if you have any questions.

License

See the LICENSE file for details.

Owner
DMIS Laboratory - Korea University
Data Mining & Information Systems Laboratory @ Korea University
DMIS Laboratory - Korea University
code for our ECCV 2020 paper "A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation"

Code for our ECCV (2020) paper A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation. Prerequisites: python == 3.6.8 pytorch ==1.1.0

32 Nov 27, 2022
[ICCV2021] Safety-aware Motion Prediction with Unseen Vehicles for Autonomous Driving

Safety-aware Motion Prediction with Unseen Vehicles for Autonomous Driving Safety-aware Motion Prediction with Unseen Vehicles for Autonomous Driving

Xuanchi Ren 44 Dec 03, 2022
4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

A Two-Stage Shake-Shake Network for Long-tailed Recognition of SAR Aerial View Objects 4st place solution for the PBVS 2022 Multi-modal Aerial View Ob

LinpengPan 5 Nov 09, 2022
Garbage Detection system which will detect objects based on whether it is plastic waste or plastics or just garbage.

Garbage Detection using Yolov5 on Jetson Nano 2gb Developer Kit. Garbage detection system which will detect objects based on whether it is plastic was

Rishikesh A. Bondade 2 May 13, 2022
A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022
Deeprl - Standard DQN and dueling network for simple games

DeepRL This code implements the standard deep Q-learning and dueling network with experience replay (memory buffer) for playing simple games. DQN algo

Yao Zhou 6 Apr 12, 2020
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

23 Oct 17, 2022
[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN) Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen Accepted b

57 Dec 28, 2022
PyTorch implementation of our CVPR2021 (oral) paper "Prototype Augmentation and Self-Supervision for Incremental Learning"

PASS - Official PyTorch Implementation [CVPR2021 Oral] Prototype Augmentation and Self-Supervision for Incremental Learning Fei Zhu, Xu-Yao Zhang, Chu

67 Dec 27, 2022
Official PyTorch Implementation of "Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs". NeurIPS 2020.

Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs This repository is the implementation of SELAR. Dasol Hwang* , Jinyoung Pa

MLV Lab (Machine Learning and Vision Lab at Korea University) 48 Nov 09, 2022
Awesome Graph Classification - A collection of important graph embedding, classification and representation learning papers with implementations.

A collection of graph classification methods, covering embedding, deep learning, graph kernel and factorization papers

Benedek Rozemberczki 4.5k Jan 01, 2023
2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation

2021-MICCAI-Progressively Normalized Self-Attention Network for Video Polyp Segmentation Authors: Ge-Peng Ji*, Yu-Cheng Chou*, Deng-Ping Fan, Geng Che

Ge-Peng Ji (Daniel) 85 Dec 30, 2022
prior-based-losses-for-medical-image-segmentation

Repository for papers: Benchmark: Effect of Prior-based Losses on Segmentation Performance: A Benchmark Midl: A Surprisingly Effective Perimeter-based

Rosana EL JURDI 9 Sep 07, 2022
Autoencoders pretraining using clustering

Autoencoders pretraining using clustering

IITiS PAN 2 Dec 16, 2021
TensorFlow implementation of AlexNet and its training and testing on ImageNet ILSVRC 2012 dataset

AlexNet training on ImageNet LSVRC 2012 This repository contains an implementation of AlexNet convolutional neural network and its training and testin

Matteo Dunnhofer 161 Nov 25, 2022
Python interface for the DIGIT tactile sensor

DIGIT-INTERFACE Python interface for the DIGIT tactile sensor. For updates and discussions please join the #DIGIT channel at the www.touch-sensing.org

Facebook Research 35 Dec 22, 2022
Semantic Segmentation with SegFormer on Drone Dataset.

SegFormer_Segmentation Semantic Segmentation with SegFormer on Drone Dataset. You can check out the blog on Medium You can also try out the model with

Praneet 8 Oct 20, 2022
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

4 Apr 20, 2022
Time Series Cross-Validation -- an extension for scikit-learn

TSCV: Time Series Cross-Validation This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the traini

Wenjie Zheng 222 Jan 01, 2023