A project for developing transformer-based models for clinical relation extraction

Last update: Dec 19, 2022

Overview

Clinical Relation Extration with Transformers

Aim

This package is developed for researchers easily to use state-of-the-art transformers models for extracting relations from clinical notes. No prior knowledge of transformers is required. We handle the whole process from data preprocessing to training to prediction.

Dependency

The package is built on top of the Transformers developed by the HuggingFace. We have the requirement.txt to specify the packages required to run the project.

Background

Our training strategy is inspired by the paper: https://arxiv.org/abs/1906.03158 We only support train-dev mode, but you can do 5-fold CV.

Available models

BERT
XLNet
RoBERTa
ALBERT
DeBERTa
Longformer

We will keep adding new models.

usage and example

data format

see sample_data dir (train.tsv and test.tsv) for the train and test data format

The sample data is a small subset of the data prepared from the 2018 umass made1.0 challenge corpus

# data format: tsv file with 8 columns:
1. relation_type: adverse
2. sentence_1: ALLERGIES : [s1] Penicillin [e1] .
3. sentence_2: [s2] ALLERGIES [e2] : Penicillin .
4. entity_type_1: Drug
5. entity_type_2: ADE
6. entity_id_1: T1
7. entity_id2: T2
8. file_id: 13_10

note: 
1) the entity between [s1][e1] is the first entity in a relation; the second entity in the relation is inbetween [s2][e2]
2) even the two entities in the same sentenc, we still require to put them separately
3) in the test.tsv, you can set all labels to neg or no_relation or whatever, because we will not use the label anyway
4) We recommend to evaluate the test performance in a separate process based on prediction. (see **post-processing**)
5) We recommend using official evaluation scripts to do evaluation to make sure the results reported are reliable.

preprocess data (see the preprocess.ipynb script for more details on usage)

we did not provide a script for training and test data generation

we have a jupyter notebook with preprocessing 2018 n2c2 data as an example

you can follow our example to generate your own dataset

special tags

we use 4 special tags to identify two entities in a relation

# the defaults tags we defined in the repo are

EN1_START = "[s1]"
EN1_END = "[e1]"
EN2_START = "[s2]"
EN2_END = "[e2]"

If you need to customize these tags, you can change them in
config.py

training

please refer to the wiki page for all details of the parameters flag details

export CUDA_VISIBLE_DEVICES=1
data_dir=./sample_data
nmd=./new_modelzw
pof=./predictions.txt
log=./log.txt

# NOTE: we have more options available, you can check our wiki for more information
python ./src/relation_extraction.py \
		--model_type bert \
		--data_format_mode 0 \
		--classification_scheme 1 \
		--pretrained_model bert-base-uncased \
		--data_dir $data_dir \
		--new_model_dir $nmd \
		--predict_output_file $pof \
		--overwrite_model_dir \
		--seed 13 \
		--max_seq_length 256 \
		--cache_data \
		--do_train \
		--do_lower_case \
		--train_batch_size 4 \
		--eval_batch_size 4 \
		--learning_rate 1e-5 \
		--num_train_epochs 3 \
		--gradient_accumulation_steps 1 \
		--do_warmup \
		--warmup_ratio 0.1 \
		--weight_decay 0 \
		--max_num_checkpoints 1 \
		--log_file $log \

prediction

export CUDA_VISIBLE_DEVICES=1
data_dir=./sample_data
nmd=./new_model
pof=./predictions.txt
log=./log.txt

# we have to set data_dir, new_model_dir, model_type, log_file, and eval_batch_size, data_format_mode
python ./src/relation_extraction.py \
		--model_type bert \
		--data_format_mode 0 \
		--classification_scheme 1 \
		--pretrained_model bert-base-uncased \
		--data_dir $data_dir \
		--new_model_dir $nmd \
		--predict_output_file $pof \
		--overwrite_model_dir \
		--seed 13 \
		--max_seq_length 256 \
		--cache_data \
		--do_predict \
		--do_lower_case \
		--eval_batch_size 4 \
		--log_file $log \

post-processing (we only support transformation to brat format)

# see --help for more information
data_dir=./sample_data
pof=./predictions.txt

python src/data_processing/post_processing.py \
		--mode mul \
		--predict_result_file $pof \
		--entity_data_dir ./test_data_entity_only \
		--test_data_file ${data_dir}/test.tsv \
		--brat_result_output_dir ./brat_output

Using json file for experiment config instead of commend line

to simplify using the package, we support using json file for configuration
using json, you can define all parameters in a separate json file instead of input via commend line
config_experiment_sample.json is a sample json file you can follow to develop yours
to run experiment with json config, you need to follow run_json.sh

export CUDA_VISIBLE_DEVICES=1

python ./src/relation_extraction_json.py \
		--config_json "./config_experiment_sample.json"

Baseline (baseline directory)

We also implemented some baselines for relation extraction using machine learning approaches
baseline is for comparison only
baseline based on SVM
features extracted may not optimize for each dataset (cover most commonly used lexical and semantic features)
see baseline/run.sh for example

Issues

raise an issue if you have problems.

Citation

please cite our paper:

# We have a preprint at
https://arxiv.org/abs/2107.08957

Clinical Pre-trained Transformer Models

We have a series transformer models pre-trained on MIMIC-III. You can find them here:

Comments

prediction on large corpus

The package will have issues dealing with the prediction on a large corpus (e.g., thousands of notes). We need to develop a batch process to avoid OOM issue and parallel may be to speed up.
enhancement

opened by bugface 2
Not able to get the prediction for Test.csv

Hi

I am just trying to run the code to get the predictions for the test.csv. i am trying with the pre trained model at https://transformer-models.s3.amazonaws.com/mimiciii_bert_10e_128b.zip.

While running code I am getting an error as AttributeError: 'BertConfig' object has no attribute 'tags'

Screen shot of my scree is as below

opened by vikasgoel2000 1
Binary classification with BCELoss or Focal Loss

For binary mode, we currently still use CrossEntropyLoss, but BCELoss is designed for binary classification. We need to add options to use BCELoss or Focal Loss in binary mode
enhancement

opened by bugface 1
Ok

Keep forgetting your Singpass username and password? Set it up once on Singpass app for password-free logins next time.

Download Singpass app at https://app.singpass.gov.sg/share?src=gxe1ax

opened by Andre11232 0
Confused on usage

The input to the prediction model is a .tsv file where the first column is the relation type. So it is unclear to me why we need the model to predict the relation type again.

Am I misunderstanding? For predicting relations for new data, will the first column be autofilled with NonRel?

opened by jiwonjoung 1

roberta question

Thank you for providing and actively maintaining this repository. I'm trying to run the roberta on the sample data, but I'm encountering an error (I have tested bert and deberta, and both worked well without any error)

Here is the code I ran

export CUDA_VISIBLE_DEVICES=1
data_dir=./sample_data
nmd=./roberta_re_model
pof=./roberta_re_predictions.txt
log=./roberta_re_log.txt

python ./src/relation_extraction.py \
		--model_type roberta \
		--data_format_mode 0 \
		--classification_scheme 2 \
		--pretrained_model roberta-base \
		--data_dir $data_dir \
		--new_model_dir $nmd \
		--predict_output_file $pof \
		--overwrite_model_dir \
		--seed 13 \
		--max_seq_length 256 \
		--cache_data \
		--do_train \
		--do_lower_case \
                --do_predict \
		--train_batch_size 4 \
		--eval_batch_size 4 \
		--learning_rate 1e-5 \
		--num_train_epochs 3 \
		--gradient_accumulation_steps 1 \
		--do_warmup \
		--warmup_ratio 0.1 \
		--weight_decay 0 \
		--max_num_checkpoints 1 \
		--log_file $log \

but I ran into this error:

2022-05-12 06:07:50 - Transformer_Relation_Extraction - ERROR - Training error:
Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
    task_runner.train()
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
    batch_output = self.model(**batch_input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
    output_hidden_states=output_hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
    task_runner.train()
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
    batch_output = self.model(**batch_input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
    output_hidden_states=output_hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 59, in app
    task_runner.train()
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/task.py", line 100, in train
    batch_output = self.model(**batch_input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/models.py", line 159, in forward
    output_hidden_states=output_hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
    past_key_values_length=past_key_values_length,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 133, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 181, in <module>
    app(args)
  File "/content/drive/MyDrive/Colab Notebooks/ClinicalTransformer/src/relation_extraction.py", line 63, in app
    raise RuntimeError()
RuntimeError

Any help would be much appreciated. Thanks for your project!

opened by jeonge1 4

save trained model as a RE model and a core model with only transformer layers

we need to separately save the whole RE model and a core transformer model with only transformer layers so that the model can be used for other training tasks.
enhancement

opened by bugface 0

ELECTRA and GPT2 support

Hi,

I'm wondering how to add ELECTRA and GPT2 support to this module.

Neither ELECTRA nor GPT2 has pooled output, unlike BERT/RoBERTa-based model.

I noticed in the models.py the model is implemented as following:

        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states
        )

        pooled_output = outputs[1]
        seq_output = outputs[0]
        logits = self.output2logits(pooled_output, seq_output, input_ids)

        return self.calc_loss(logits, outputs, labels)

There are no pooled_output for ELECTRA/GPT2 sequence classification models, only seq_output is in the outputs variable.

How to get around this limitation and get a working version of ELECTRA/GPT2? Thank you!

opened by Stochastic-Adventure 2

Releases(v1.0.0)

v1.0.0(Jan 28, 2022)

Source code(tar.gz)
Source code(zip)

Owner

uf-hobi-informatics-lab

codebase for hobi informatics lab

GitHub Repository

Cross-platform CLI tool to generate your Github profile's stats and summary.

ghs Cross-platform CLI tool to generate your Github profile's stats and summary. Preview Hop on to examples for other usecases. Jump to: Installation

134 Dec 20, 2022

PyTorch code for our paper "Gated Multiple Feedback Network for Image Super-Resolution" (BMVC2019)

Gated Multiple Feedback Network for Image Super-Resolution This repository contains the PyTorch implementation for the proposed GMFN [arXiv]. The fram

66 Nov 03, 2022

Graph Representation Learning via Graphical Mutual Information Maximization

GMI (Graphical Mutual Information) Graph Representation Learning via Graphical Mutual Information Maximization (Peng Z, Huang W, Luo M, et al., WWW 20

93 Dec 29, 2022

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021. Introduction We proposed a novel model training paradi

103 Dec 14, 2022

code for our ECCV 2020 paper "A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation"

Code for our ECCV (2020) paper A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation. Prerequisites: python == 3.6.8 pytorch ==1.1.0

32 Nov 27, 2022

List some popular DeepFake models e.g. DeepFake, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, SimSwap, CihaNet, etc.

deepfake-models List some popular DeepFake models e.g. DeepFake, CihaNet, SimSwap, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, Si

100 Dec 17, 2022

Computer Vision Script to recognize first person motion, developed as final project for the course "Machine Learning and Deep Learning"

Overview of The Code BaseColab/MLDL_FPAR.pdf: it contains the full explanation of our work Base Colab: it contains the base colab used to perform all

4 Jul 16, 2022

particle tracking model, works with the ROMS output file(qck.nc, his.nc)

particle-tracking-model-for-ROMS particle tracking model, works with the ROMS output file(qck.nc, his.nc) description this is a 2-dimensional particle

1 Jan 11, 2022

Automatic detection and classification of Covid severity degree in LUS (lung ultrasound) scans

Final-Project Final project in the Technion, Biomedical faculty, by Mor Ventura, Dekel Brav & Omri Magen. Subproject 1: Automatic Detection of LUS Cha

1 Dec 18, 2021

Vignette is a face tracking software for characters using osu!framework.

Vignette is a face tracking software for characters using osu!framework. Unlike most solutions, Vignette is: Made with osu!framework, the game framewo

412 Dec 28, 2022

CNN Based Meta-Learning for Noisy Image Classification and Template Matching

CNN Based Meta-Learning for Noisy Image Classification and Template Matching Introduction This master thesis used a few-shot meta learning approach to

2 Dec 09, 2021

AgeGuesser: deep learning based age estimation system. Powered by EfficientNet and Yolov5

AgeGuesser AgeGuesser is an end-to-end, deep-learning based Age Estimation system, presented at the CAIP 2021 conference. You can find the related pap

5 Nov 10, 2022

Bayesian regularization for functional graphical models.

BayesFGM Paper: Jiajing Niu, Andrew Brown. Bayesian regularization for functional graphical models. Requirements R version 3.6.3 and up Python 3.6 and

0 Oct 07, 2021

A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

WebDataset WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and us

1.1k Jan 08, 2023