Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Related tags

Text Data & NLPNERDA
Overview

NERDA

Build status codecov PyPI PyPI - Downloads License

Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning pretrained transformers for Named Entity Recognition (=NER) tasks.

You can also utilize NERDA to access a selection of precooked NERDA models, that you can use right off the shelf for NER tasks.

NERDA is built on huggingface transformers and the popular pytorch framework.

Installation guide

NERDA can be installed from PyPI with

pip install NERDA

If you want the development version then install directly from GitHub.

Named-Entity Recogntion tasks

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.1

Example Task:

Task

Identify person names and organizations in text:

Jim bought 300 shares of Acme Corp.

Solution

Named Entity Type
'Jim' Person
'Acme Corp.' Organization

Read more about NER on Wikipedia.

Train Your Own NERDA Model

Say, we want to fine-tune a pretrained Multilingual BERT transformer for NER in English.

Load package.

from NERDA.models import NERDA

Instantiate a NERDA model (with default settings) for the CoNLL-2003 English NER data set.

from NERDA.datasets import get_conll_data
model = NERDA(dataset_training = get_conll_data('train'),
              dataset_validation = get_conll_data('valid'),
              transformer = 'bert-base-multilingual-uncased')

By default the network architecture is analogous to that of the models in Hvingelby et al. 2020.

The model can then be trained/fine-tuned by invoking the train method, e.g.

model.train()

Note: this will take some time depending on the dimensions of your machine (if you want to skip training, you can go ahead and use one of the models, that we have already precooked for you in stead).

After the model has been trained, the model can be used for predicting named entities in new texts.

# text to identify named entities in.
text = 'Old MacDonald had a farm'
model.predict_text(text)
([['Old', 'MacDonald', 'had', 'a', 'farm']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

This means, that the model identified 'Old MacDonald' as a PERson.

Please note, that the NERDA model configuration above was instantiated with all default settings. You can however customize your NERDA model in a lot of ways:

  • Use your own data set (finetune a transformer for any given language)
  • Choose whatever transformer you like
  • Set all of the hyperparameters for the model
  • You can even apply your own Network Architecture

Read more about advanced usage of NERDA in the detailed documentation.

Use a Precooked NERDA model

We have precooked a number of NERDA models for Danish and English, that you can download and use right off the shelf.

Here is an example.

Instantiate a multilingual BERT model, that has been finetuned for NER in Danish, DA_BERT_ML.

from NERDA.precooked import DA_BERT_ML()
model = DA_BERT_ML()

Down(load) network from web:

model.download_network()
model.load_network()

You can now predict named entities in new (Danish) texts

# (Danish) text to identify named entities in:
# 'Jens Hansen har en bondegård' = 'Old MacDonald had a farm'
text = 'Jens Hansen har en bondegård'
model.predict_text(text)
([['Jens', 'Hansen', 'har', 'en', 'bondegård']], [['B-PER', 'I-PER', 'O', 'O', 'O']])

List of Precooked Models

The table below shows the precooked NERDA models publicly available for download.

Model Language Transformer Dataset F1-score
DA_BERT_ML Danish Multilingual BERT DaNE 82.8
DA_ELECTRA_DA Danish Danish ELECTRA DaNE 79.8
EN_BERT_ML English Multilingual BERT CoNLL-2003 90.4
EN_ELECTRA_EN English English ELECTRA CoNLL-2003 89.1

F1-score is the micro-averaged F1-score across entity tags and is evaluated on the respective test sets (that have not been used for training nor validation of the models).

Note, that we have not spent a lot of time on actually fine-tuning the models, so there could be room for improvement. If you are able to improve the models, we will be happy to hear from you and include your NERDA model.

Model Performance

The table below summarizes the performance (F1-scores) of the precooked NERDA models.

Level DA_BERT_ML DA_ELECTRA_DA EN_BERT_ML EN_ELECTRA_EN
B-PER 93.8 92.0 96.0 95.1
I-PER 97.8 97.1 98.5 97.9
B-ORG 69.5 66.9 88.4 86.2
I-ORG 69.9 70.7 85.7 83.1
B-LOC 82.5 79.0 92.3 91.1
I-LOC 31.6 44.4 83.9 80.5
B-MISC 73.4 68.6 81.8 80.1
I-MISC 86.1 63.6 63.4 68.4
AVG_MICRO 82.8 79.8 90.4 89.1
AVG_MACRO 75.6 72.8 86.3 85.3

'NERDA'?

'NERDA' originally stands for 'Named Entity Recognition for DAnish'. However, this is somewhat misleading, since the functionality is no longer limited to Danish. On the contrary it generalizes to all other languages, i.e. NERDA supports fine-tuning of transformers for NER tasks for any arbitrary language.

Background

NERDA is developed as a part of Ekstra Bladet’s activities on Platform Intelligence in News (PIN). PIN is an industrial research project that is carried out in collaboration between the Technical University of Denmark, University of Copenhagen and Copenhagen Business School with funding from Innovation Fund Denmark. The project runs from 2020-2023 and develops recommender systems and natural language processing systems geared for news publishing, some of which are open sourced like NERDA.

Shout-outs

Read more

The detailed documentation for NERDA including code references and extended workflow examples can be accessed here.

Contact

We hope, that you will find NERDA useful.

Please direct any questions and feedbacks to us!

If you want to contribute (which we encourage you to), open a PR.

If you encounter a bug or want to suggest an enhancement, please open an issue.

Owner
Ekstra Bladet
GitHub of Ekstra Bladet Analyse
Ekstra Bladet
An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

David Thorne 0 Feb 06, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 06, 2023
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

Telekom Open Source Software 13 Oct 24, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021