LMs for biomedical KG completion

This repository contains code to run the experiments described in:

Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study (arXiv link)
Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A. Smith, Hannaneh Hajishirzi, Tom Hope

Data

The edge splits we used for our experiments can be downloaded using the following links:

Link	File size
RepoDB, transductive split	11 MB
RepoDB, inductive split	11 MB
Hetionet, transductive split	49 MB
Hetionet, inductive split	49 MB
MSI, transductive split	813 MB
MSI, inductive split	813 MB

Each of these filees should be placed in the subgraph directory before running any of the experiment scripts. Please see the README.md file in the subgraph directory for more information on the edge split files. If you would like to recreate the edge splits yourself or construct new edge splits, use the scripts titled script/create_*_dataset.py.

Environment

The environment.yml file contains all of the necessary packages to use this code. We recommend using Anaconda/Miniconda to set up an environment, which you can do with the command

conda-env create -f environment.yml

Entity names and descriptions

The files that contain entity names and descriptions for all of the datasets can be found in data/processed directory. If you would like to recreate these files yourself, you will need to use the scripts for each dataset found in data/script.

Pre-tokenization

The main training script for the LMs src/lm/run.py can take in pre-tokenized entity names and descriptions as input, and several of the training scripts in script/training are set up to do so. If you would like to pre-tokenize text before fine-tuning, follow the instructions in script/pretokenize.py. You can also pass in one of the .tsv files found in data/processed for the argument --info_filename instead of a file with pre-tokenized text.

Training

All of the scripts for training models can be found in the src directory. The script for training all KGE models is src/kge/run.py, while the script for training LMs is src/lm/run.py. Our code for training KGE models is heavily based on this code from the Open Graph Benchmark Github repository. Examples of how to use each of these scripts, including training with Slurm, can be found in the script/training directory. This directory includes all of the scripts we used to run the experiments for the results in the paper.

Using pretrained language models for biomedical knowledge graph completion.

Related tags

Overview

LMs for biomedical KG completion

Data

Environment

Entity names and descriptions

Pre-tokenization

Training

Owner

Rahul Nadkarni

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Implementation of Convolutional enhanced image Transformer

Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression", TIP 2020

PyTorch code for the paper "FIERY: Future Instance Segmentation in Bird's-Eye view from Surround Monocular Cameras"

PyTorch implementation of VAGAN: Visual Feature Attribution Using Wasserstein GANs

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

FaRL for Facial Representation Learning

TensorFlow for Raspberry Pi

Official PyTorch Implementation of "Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs". NeurIPS 2020.

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.

Learning cell communication from spatial graphs of cells

Implementation of Ag-Grid component for Streamlit

Code accompanying "Evolving spiking neuron cellular automata and networks to emulate in vitro neuronal activity," accepted to IEEE SSCI ICES 2021

LibMTL: A PyTorch Library for Multi-Task Learning

AI virtual gym is an AI program which can be used to exercise and can be used to see if we are doing the exercises

A python interface for training Reinforcement Learning bots to battle on pokemon showdown

Source for the paper "Universal Activation Function for machine learning"

This is the official implementation code repository of Underwater Light Field Retention : Neural Rendering for Underwater Imaging (Accepted by CVPR Workshop2022 NTIRE)

This repository contains the files for running the Patchify GUI.