EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

Overview

BioLAMA

BioLAMA

BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CTD, UMLS, and Wikidata. Please see our paper Can Language Models be Biomedical Knowledge Bases? (Sung et al., 2021) for more details.

* The dataset for the BioLAMA probe is available at data.tar.gz

Getting Started

After the installation, you can easily try BioLAMA with manual prompts. When a subject is "flu" and you want to probe its symptoms from an LM, the input should be like "Flu has symptom such as [Y]."

# Set MODEL to bert-base-cased for BERT or dmis-lab/biobert-base-cased-v1.2 for BioBERT
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
python ./BioLAMA/cli_demo.py \
    --model_name_or_path ${MODEL}

Result:

Please enter input (e.g., Flu has symptoms such as [Y].):
hepatocellular carcinoma has symptoms such as [Y].
-------------------------
Rank    Prob    Pred
-------------------------
1       0.648   jaundice
2       0.223   abdominal pain
3       0.127   jaundice and ascites
4       0.11    ascites
5       0.086   hepatomegaly
6       0.074   obstructive jaundice
7       0.06    abdominal pain and jaundice
8       0.059   ascites and jaundice
9       0.043   anorexia and jaundice
10      0.042   fever and jaundice
-------------------------
Top1 prediction sentence:
"hepatocellular carcinoma has symptoms such as jaundice."

Quick Link

Installation

# Install torch with conda (please check your CUDA version)
conda create -n BioLAMA python=3.7
conda activate BioLAMA
conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch

# Install BioLAMA
git clone https://github.com/dmis-lab/BioLAMA.git
cd BioLAMA
pip install -r requirements.txt

Resources

Models

For BERT and BioBERT, we use checkpoints provided in the Huggingface Hub:

Bio-LM is not provided in the Huggingface Hub. Therefore, we use the Bio-LM checkpoint released in link. Among the various versions of Bio-LMs, we use `RoBERTa-base-PM-Voc-hf'.

wget https://dl.fbaipublicfiles.com/biolm/RoBERTa-base-PM-Voc-hf.tar.gz
tar -xzvf RoBERTa-base-PM-Voc-hf.tar.gz 
rm -rf RoBERTa-base-PM-Voc-hf.tar.gz

Datasets

The dataset will take about 78 MB of space. Download data.tar.gz and uncompress it.

tar -xzvf data.tar.gz
rm -rf data.tar.gz

The directory tree of the data is like:

data
├── ctd
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── CD1
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
├── wikidata
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── P2175
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
└── umls
    ├── meta
    └── prompts

Important: Triples of UMLS is not provided due to the license. For those who want to probe LMs using triples of UMLS, we provide the pre-processing scripts for UMLS. Please follow this instruction.

Experiments

We provide two ways of probing PLMs with BioLAMA:

Manual Prompt

Manual Prompt probes PLMs using pre-defined manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl

python ./BioLAMA/run_manual.py \
    --model_name_or_path ${MODEL} \
    --prompt_path ${PROMPT_PATH} \
    --test_path "${TEST_PATH}" \
    --init_method confidence \
    --iter_method none \
    --num_mask 10 \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --output_dir ./output/${TASK}_manual

Result:

PID     [email protected]   [email protected]
-------------------------
P2175   9.40    21.11
P2176   22.46   39.75
P2293   2.24    11.43
P4044   9.47    19.47
P780    16.30   37.85
-------------------------
MACRO   11.97   25.92

OptiPrompt

OptiPrompt probes PLMs using embedding-based prompts starting from embeddings of manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TRAIN_PATH=./data/${TASK}/triples_processed/*/train.jsonl
DEV_PATH=./data/${TASK}/triples_processed/*/dev.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl

python ./BioLAMA/run_optiprompt.py \
    --model_name_or_path ${MODEL} \
    --train_path "${TRAIN_PATH}" \
    --dev_path "${DEV_PATH}" \
    --test_path "${TEST_PATH}" \
    --prompt_path ${PROMPT_PATH} \
    --num_mask 10 \
    --init_method confidence \
    --iter_method none \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --lr 3e-3 \
    --epochs 10 \
    --seed 0 \
    --prompt_token_len 5 \
    --init_manual_template \
    --output_dir ./output/${TASK}_optiprompt

Result:

PID     [email protected]   [email protected]
-------------------------
P2175   9.47    24.94
P2176   20.14   39.57
P2293   2.90    9.21
P4044   7.53    18.58
P780    12.98   33.43
-------------------------
MACRO   7.28    18.51

Acknowledgement

Parts of the code are modified from genewikiworld, X-FACTR, and OptiPrompt. We appreciate the authors for making their projects open-sourced.

Citations

@inproceedings{sung2021can,
    title={Can Language Models be Biomedical Knowledge Bases},
    author={Sung, Mujeen and Lee, Jinhyuk and Yi, Sean and Jeon, Minji and Kim, Sungdong and Kang, Jaewoo},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2021},
}
Owner
DMIS Laboratory - Korea University
Data Mining & Information Systems Laboratory @ Korea University
DMIS Laboratory - Korea University
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 03, 2023
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 29, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022