"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

Related tags

Deep Learningblp
Overview

Inductive entity representations from text via link prediction





This repository contains the code used for the experiments in the paper "Inductive entity representations from text via link prediction", presented at The Web Conference, 2021. To refer to our work, please use the following:

@inproceedings{daza2021inductive,
    title = {Inductive Entity Representations from Text via Link Prediction},
    author = {Daniel Daza and Michael Cochez and Paul Groth},
    booktitle = {Proceedings of The Web Conference 2021},
    year = {2021},
    doi = {10.1145/3442381.3450141},
}

In this work, we show how a BERT-based text encoder can be fine-tuned with a link prediction objective, in a graph where entities have an associated textual description. We call the resulting model BLP. There are three interesting properties of a trained BLP model:

  • It can predict a link between entities, even if one or both were not present during training.
  • It produces useful representations for a classifier, that don't require retraining the encoder.
  • It improves an information retrieval system, by better matching entities and questions about them.

Usage

Please follow the instructions next to reproduce our experiments, and to train a model with your own data.

1. Install the requirements

Creating a new environment (e.g. with conda) is recommended. Use requirements.txt to install the dependencies:

conda create -n blp python=3.7
conda activate blp
pip install -r requirements.txt

2. Download the data

Download the required compressed datasets into the data folder:

Download link Size (compressed)
UMLS (small graph for tests) 121 KB
WN18RR 6.6 MB
FB15k-237 21 MB
Wikidata5M 1.4 GB
GloVe embeddings 423 MB
DBpedia-Entity 1.3 GB

Then use tar to extract the files, e.g.

tar -xzvf WN18RR.tar.gz

Note that the KG-related files above contain both transductive and inductive splits. Transductive splits are commonly used to evaluate lookup-table methods like ComplEx, while inductive splits contain entities in the test set that are not present in the training set. Files with triples for the inductive case have the ind prefix, e.g. ind-train.txt.

2. Reproduce the experiments

Link prediction

To check that all dependencies are correctly installed, run a quick test on a small graph (this should take less than 1 minute on GPU):

./scripts/test-umls.sh

The following table is a adapted from our paper. The "Script" column contains the name of the script that reproduces the experiment for the corresponding model and dataset. For example, if you want to reproduce the results of BLP-TransE on FB15k-237, run

./scripts/blp-transe-fb15k237.sh
WN18RR FB15k-237 Wikidata5M
Model MRR Script MRR Script MRR Script
GlovE-BOW 0.170 glove-bow-wn18rr.sh 0.172 glove-bow-fb15k237.sh 0.343 glove-bow-wikidata5m.sh
BE-BOW 0.180 bert-bow-wn18rr.sh 0.173 bert-bow-fb15k237.sh 0.362 bert-bow-wikidata5m.sh
GloVe-DKRL 0.115 glove-dkrl-wn18rr.sh 0.112 glove-dkrl-fb15k237.sh 0.282 glove-dkrl-wikidata5m.sh
BE-DKRL 0.139 bert-dkrl-wn18rr.sh 0.144 bert-dkrl-fb15k237.sh 0.322 bert-dkrl-wikidata5m.sh
BLP-TransE 0.285 blp-transe-wn18rr.sh 0.195 blp-transe-fb15k237.sh 0.478 blp-transe-wikidata5m.sh
BLP-DistMult 0.248 blp-distmult-wn18rr.sh 0.146 blp-distmult-fb15k237.sh 0.472 blp-distmult-wikidata5m.sh
BLP-ComplEx 0.261 blp-complex-wn18rr.sh 0.148 blp-complex-fb15k237.sh 0.489 blp-complex-wikidata5m.sh
BLP-SimplE 0.239 blp-simple-wn18rr.sh 0.144 blp-simple-fb15k237.sh 0.493 blp-simple-wikidata5m.sh

Entity classification

After training for link prediction, a tensor of embeddings for all entities is computed and saved in a file with name ent_emb-[ID].pt where [ID] is the id of the experiment in the database (we use Sacred to manage experiments). Another file called ents-[ID].pt contains entity identifiers for every row in the tensor of embeddings.

To ease reproducibility, we provide these tensors, which are required in the entity classification task. Click on the ID, download the file into the output folder, and decompress it. An experiment can be reproduced using the following command:

python train.py node_classification with checkpoint=ID dataset=DATASET

where DATASET is either WN18RR or FB15k-237. For example:

python train.py node_classification with checkpoint=199 dataset=WN18RR
WN18RR FB15k-237
Model Acc. ID Acc. Bal. ID
GloVe-BOW 55.3 219 34.4 293
BE-BOW 60.7 218 28.3 296
GloVe-DKRL 55.5 206 26.6 295
BE-DKRL 48.8 207 30.9 294
BLP-TransE 81.5 199 42.5 297
BLP-DistMult 78.5 200 41.0 298
BLP-ComplEx 78.1 201 38.1 300
BLP-SimplE 83.0 202 45.7 299

Information retrieval

This task runs with a pre-trained model saved from the link prediction task. For example, if the model trained is blp with transe and it was saved as model.pt, then run the following command to run the information retrieval task:

python retrieval.py with model=blp rel_model=transe \
checkpoint='output/model.pt'

Using your own data

If you have a knowledge graph where entities have textual descriptions, you can train a BLP model for the tasks of inductive link prediction, and entity classification (if you also have labels for entities).

To do this, add a new folder inside the data folder (let's call it my-kg). Store in it a file containing the triples in your KG. This should be a text file with one tab-separated triple per line (let's call it all-triples.tsv).

To generate inductive splits, you can use data/utils.py. If you run

python utils.py drop_entities --file=my-kg/all-triples.tsv

this will generate ind-train.tsv, ind-dev.tsv, ind-test.tsv inside my-kg (see Appendix A in our paper for details on how these are generated). You can then train BLP-TransE with

python train.py with dataset='my-kg'

Alternative implementations

Owner
Daniel Daza
PhD student at VU Amsterdam and the University of Amsterdam, working on machine learning and knowledge graphs.
Daniel Daza
Code to replicate the key results from Exploring the Limits of Out-of-Distribution Detection

Exploring the Limits of Out-of-Distribution Detection In this repository we're collecting replications for the key experiments in the Exploring the Li

Stanislav Fort 35 Jan 03, 2023
RealTime Emotion Recognizer for Machine Learning Study Jam's demo

Emotion recognizer Table of contents Clone project Dataset Install dependencies Main program Demo 1. Clone project git clone https://github.com/GDSC20

Google Developer Student Club - UIT 1 Oct 05, 2021
Code of Adverse Weather Image Translation with Asymmetric and Uncertainty aware GAN

Adverse Weather Image Translation with Asymmetric and Uncertainty-aware GAN (AU-GAN) Official Tensorflow implementation of Adverse Weather Image Trans

Jeong-gi Kwak 36 Dec 26, 2022
Full-featured Decision Trees and Random Forests learner.

CID3 This is a full-featured Decision Trees and Random Forests learner. It can save trees or forests to disk for later use. It is possible to query tr

Alejandro Penate-Diaz 3 Aug 15, 2022
An AutoML Library made with Optuna and PyTorch Lightning

An AutoML Library made with Optuna and PyTorch Lightning Installation Recommended pip install -U gradsflow From source pip install git+https://github.

GradsFlow 294 Dec 17, 2022
免费获取http代理并生成proxifier配置文件

freeproxy 免费获取http代理并生成proxifier配置文件 公众号:台下言书 工具说明:https://mp.weixin.qq.com/s?__biz=MzIyNDkwNjQ5Ng==&mid=2247484425&idx=1&sn=56ccbe130822aa35038095317

说书人 32 Mar 25, 2022
Net2net - Network-to-Network Translation with Conditional Invertible Neural Networks

Net2Net Code accompanying the NeurIPS 2020 oral paper Network-to-Network Translation with Conditional Invertible Neural Networks Robin Rombach*, Patri

CompVis Heidelberg 206 Dec 20, 2022
Deep Learning Pipelines for Apache Spark

Deep Learning Pipelines for Apache Spark The repo only contains HorovodRunner code for local CI and API docs. To use HorovodRunner for distributed tra

Databricks 2k Jan 08, 2023
g2o: A General Framework for Graph Optimization

g2o - General Graph Optimization Linux: Windows: g2o is an open-source C++ framework for optimizing graph-based nonlinear error functions. g2o has bee

Rainer Kümmerle 2.5k Dec 30, 2022
Face Mask Detection System built with OpenCV, TensorFlow using Computer Vision concepts

Face mask detection Face Mask Detection System built with OpenCV, TensorFlow using Computer Vision concepts in order to detect face masks in static im

Vaibhav Shukla 1 Oct 27, 2021
[Pedestron] Generalizable Pedestrian Detection: The Elephant In The Room. @ CVPR2021

Pedestron Pedestron is a MMdetection based repository, that focuses on the advancement of research on pedestrian detection. We provide a list of detec

Irtiza Hasan 594 Jan 05, 2023
Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Non-attentive Tacotron - PyTorch Implementation This is Pytorch Implementation of Google's Non-attentive Tacotron, text-to-speech system. There is som

Jounghee Kim 46 Dec 19, 2022
Rasterize with the least efforts for researchers.

utils3d Rasterize and do image-based 3D transforms with the least efforts for researchers. Based on numpy and OpenGL. It could be helpful when you wan

Ruicheng Wang 8 Dec 15, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
Useful materials and tutorials for 110-1 NTU DBME5028 (Application of Deep Learning in Medical Imaging)

Useful materials and tutorials for 110-1 NTU DBME5028 (Application of Deep Learning in Medical Imaging)

7 Jun 22, 2022
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

The Second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 Welcome to the Second Situated Interactive Multimodal Conversation

Facebook Research 81 Nov 22, 2022
PyTorch Code for NeurIPS 2021 paper Anti-Backdoor Learning: Training Clean Models on Poisoned Data.

Anti-Backdoor Learning PyTorch Code for NeurIPS 2021 paper Anti-Backdoor Learning: Training Clean Models on Poisoned Data. The Anti-Backdoor Learning

Yige-Li 51 Dec 07, 2022
Structure-Preserving Deraining with Residue Channel Prior Guidance (ICCV2021)

SPDNet Structure-Preserving Deraining with Residue Channel Prior Guidance (ICCV2021) Requirements Linux Platform NVIDIA GPU + CUDA CuDNN PyTorch == 0.

41 Dec 12, 2022
DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data.

DWIPrep: A Robust Preprocessing Pipeline for dMRI Data DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data. The transp

Gal Ben-Zvi 1 Jan 09, 2023
Gans-in-action - Companion repository to GANs in Action: Deep learning with Generative Adversarial Networks

GANs in Action by Jakub Langr and Vladimir Bok List of available code: Chapter 2: Colab, Notebook Chapter 3: Notebook Chapter 4: Notebook Chapter 6: C

GANs in Action 914 Dec 21, 2022