BERTMap: A BERT-Based Ontology Alignment System

Overview

BERTMap: A BERT-based Ontology Alignment System

Important Notices

About

BERTMap is a BERT-based ontology alignment system, which utilizes the textual knowledge of ontologies to fine-tune BERT and make prediction. It also incorporates sub-word inverted indices for candidate selection, and (graph-based) extension and (logic-based) repair modules for mapping refinement.

Essential dependencies

The following packages are necessary but not sufficient for running BERTMap:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch  # pytorch
pip install cython  # the optimized parser of owlready2 relies on Cython
pip install owlready2  # for managing ontologies
pip install tensorboard  # tensorboard logging (optional)
pip install transformers  # huggingface library
pip install datasets  # huggingface datasets

Running BERTMap

IMPORTANT NOTICE: BERTMap relies on class labels for training, but different ontologies have different annotation properties to define the aliases (synonyms), so preprocessing is required for adding all the synonyms to rdf:label before running BERTMap. The preprocessed ontologies involved in our paper together with their reference mappings are available in data.zip.

Clone the repository and run:

# fine-tuning and evaluate bertmap prediction 
python run_bertmap.py -c config.json -m bertmap

# mapping extension (-e specify which mapping set {src, tgt, combined} to be extended)
python extend_bertmap.py -c config.json -e src

# evaluate extended bertmap 
python eval_bertmap.py -c config.json -e src

# repair and evluate final outputs (-t specify best validation threshold)
python repair_bertmap.py -c config.json -e src -t 0.999

# baseline models (edit similarity and pretrained bert embeddings)
python run_bertmap.py -c config.json -m nes
python run_bertmap.py -c config.json -m bertembeds

The script skips data construction once built for the first time to ensure that all of the models share the same set of pre-processed data.

The fine-tuning model is implemented with huggingface Trainer, which by default uses multiple GPUs, for restricting to GPUs of specified indices, please run (for example):

# only device (1) and (2) are visible to the script
CUDA_VISIBLE_DEVICES=1,2 python run_bertmap.py -c config.json -m bertmap 

Configurations

Here gives the explanations of the variables used in config.json for customized BERTMap running.

  • data:
    • task_dir: directory for saving all the output files.
    • src_onto: source ontology name.
    • tgt_onto: target ontology name.
    • task_suffix: any suffix of the task if needed, e.g. the LargeBio track has 'small' and 'whole'.
    • src_onto_file: source ontology file in .owl format.
    • tgt_onto_fil: target ontology file in .owl format.
    • properties: list of textual properties used for constructing semantic data , default is class labels: ["label"].
    • cut: threshold length for the keys of sub-word inverted index, preserve the keys only if their lengths > cut, default is 0.
  • corpora:
    • sample_rate: number of (soft) negative samples for each positive sample generated in corpora (not the ultimate fine-tuning data).
    • src2tgt_mappings_file: reference mapping file for evaluation and semi-supervised learning setting in .tsv format with columns: "Entity1", "Entity2" and "Value".
    • ignored_mappings_file: file in .tsv format but stores mappings that should be ignored by the evaluator.
    • train_map_ratio: proportion of training mappings to used in semi-supervised setting, default is 0.2.
    • val_map_ratio: proportion of validation mappings to used in semi-supervised setting, default is 0.1.
    • test_map_ratio: proportion of test mappings to used in semi-supervised setting, default is 0.7.
    • io_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • io_hard_neg_rate: number of hard negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • co_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the cross-ontology level.
    • depth_threshold: classes of depths larger than this threshold will not considered in hard negative generation, default is null.
    • depth_strategy: strategy to compute the depths of the classes if any threshold is set, default is max, choices are max and min.
  • bert
    • pretrained_path: real or huggingface library path for pretrained BERT, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
    • tokenizer_path: real or huggingface library path for BERT tokenizer, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
  • fine-tune
    • include_ids: include identity synonyms in the positive samples or not.
    • learning: choice of learning setting ss (semi-supervised) or us (unsupervised).
    • warm_up_ratio: portion of warm up steps.
    • max_length: maximum length for tokenizer (highly important for large task!).
    • num_epochs: number of training epochs, default is 3.0.
    • batch_size: batch size for fine-tuning BERT.
    • early_stop: whether or not to apply early stopping (patience has been set to 10), default is false.
    • resume_checkpoint: path to previous checkpoint if any, default is null.
  • map
    • candidate_limits: list of candidate limits used for mapping computation, suggested values are [25, 50, 100, 150, 200].
    • batch_size: batch size used for mapping computation.
    • nbest: number of top results to be considered.
    • string_match: whether or not to use string match before others.
    • strategy: strategy for classifier scoring method, default is mean.
  • eval:
    • automatic: whether or not automatically evaluate the mappings.

Should you need any further customizaions especially on the evaluation part, please set eval: automatic to false and use your own evaluation script.

Acknolwedgements

The repair module is credited to Ernesto Jiménez Ruiz et al., and the code can be found here.

Owner
KRR
Knowledge Representation and Reasoning Group - University of Oxford
KRR
Pytorch implementation of OCNet series and SegFix.

openseg.pytorch News 2021/09/14 MMSegmentation has supported our ISANet and refer to ISANet for more details. 2021/08/13 We have released the implemen

openseg-group 1.1k Dec 23, 2022
Code for Paper: Self-supervised Learning of Motion Capture

Self-supervised Learning of Motion Capture This is code for the paper: Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki, Self-sup

Hsiao-Yu Fish Tung 87 Jul 25, 2022
Space Ship Simulator using python

FlyOver Basic space-ship simulator using python How to run? Just double click run.py What modules do i need? All modules that i currently using is bui

0 Oct 09, 2022
Code and data for ImageCoDe, a contextual vison-and-language benchmark

ImageCoDe This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions. Data All collected descriptions for the

McGill NLP 27 Dec 02, 2022
:fire: 2D and 3D Face alignment library build using pytorch

Face Recognition Detect facial landmarks from Python using the world's most accurate face alignment network, capable of detecting points in both 2D an

Adrian Bulat 6k Dec 31, 2022
Code for paper "Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs"

This is the codebase for the paper: Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs Directory Structur

Peter Hase 19 Aug 21, 2022
FaceAPI: AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using TensorFlow/JS

FaceAPI AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using

Vladimir Mandic 395 Dec 29, 2022
Official implementation of particle-based models (GNS and DPI-Net) on the Physion dataset.

Physion: Evaluating Physical Prediction from Vision in Humans and Machines [paper] Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Y

Hsiao-Yu Fish Tung 18 Dec 19, 2022
[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links This repo provides the model, code & data of our paper: LinkBERT: Pretraining

Michihiro Yasunaga 264 Jan 01, 2023
Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics.

Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics. By Andres Milioto @ University of Bonn. (for the new P

Photogrammetry & Robotics Bonn 314 Dec 30, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

47 Jan 01, 2023
OcclusionFusion: realtime dynamic 3D reconstruction based on single-view RGB-D

OcclusionFusion (CVPR'2022) Project Page | Paper | Video Overview This repository contains the code for the CVPR 2022 paper OcclusionFusion, where we

Wenbin Lin 193 Dec 15, 2022
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

Models for natural language understanding (NLU) tasks often rely on the idiosyncratic biases of the dataset, which make them brittle against test cases outside the training distribution.

Ubiquitous Knowledge Processing Lab 22 Jan 02, 2023
​ This is the Pytorch implementation of Progressive Attentional Manifold Alignment.

PAMA This is the Pytorch implementation of Progressive Attentional Manifold Alignment. Requirements python 3.6 pytorch 1.2.0+ PIL, numpy, matplotlib C

98 Nov 15, 2022
An open source Jetson Nano baseboard and tools to design your own.

My Jetson Nano Baseboard This basic baseboard gives the user the foundation and the flexibility to design their own baseboard for the Jetson Nano. It

NVIDIA AI IOT 57 Dec 29, 2022
Block Sparse movement pruning

Movement Pruning: Adaptive Sparsity by Fine-Tuning Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; ho

Hugging Face 54 Dec 20, 2022
The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

savior是一个能够进行快速集成算法模块并支持高性能部署的轻量开发框架。能够帮助将团队进行快速想法验证(PoC),避免重复的去github上找模型然后复现模型;能够帮助团队将功能进行流程拆解,很方便的提高分布式执行效率;能够有效减少代码冗余,减少不必要负担。

Tao Luo 125 Dec 22, 2022
ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior (CVPR 2022) Metin Ersin Arican*, Ozgur Kara*, Gustav Bredell, Ender Konukogl

Özgür Kara 24 Dec 18, 2022
NitroFE is a Python feature engineering engine which provides a variety of modules designed to internally save past dependent values for providing continuous calculation.

NitroFE is a Python feature engineering engine which provides a variety of modules designed to internally save past dependent values for providing continuous calculation.

100 Sep 28, 2022
Paper list of log-based anomaly detection

Paper list of log-based anomaly detection

Weibin Meng 411 Dec 05, 2022