Author Disambiguation using Knowledge Graph Embeddings with Literals

Related tags

Deep Learningand-kge
Overview

Author Name Disambiguation with Knowledge Graph Embeddings using Literals

This is the repository for the master thesis project on Knowledge Graph Embeddings for Author Name Disambiguation presented by Cristian Santini at Digital Humanities and Digital Knowledge - University of Bologna, with the collaboration of Information Service Engineering - FIZ Karlsruhe, in a.y. 2020/2021.

Datasets

This repository contains notebooks and scripts used for a research on Author Name Disambiguation using Knowledge Graph Embeddings (KGEs) with literals. Due to the unavailability of an established benchmark for evaluating our approach, we extracted two Knowledge Graphs (KGs) from the following publicly available resources: 1) a triplestore available on Zenodo [1] covering information about the journal Scientometrics and modelled according to the OpenCitations Data Model and 2) a publicly available benchmark for author disambiguation available at this link by AMiner.
The Knowledge Graphs extracted are available on Zenodo as OpenCitations-782K [2] and AMiner-534K [3]. Each dataset is organized as a collection of RDF triples stored in TSV format. Literal triples are stored separately in order to train multimodal Knowledge Graph Embedding models.
Each dataset contains a JSON file called and_eval.json which contains a list of publications in the scholarly KGs labelled for evaluating AND algorithms. For the evaluation, while for AMiner-534K the set of publications was already manually annotated by a team of experts, for OC-782K we used the ORCID iDs associated with the authors in the triplestore in order to create an evaluation dataset.

PyKEEN extension

The pykeen-extension directory contains extension files compatible with PyKEEN (Release: v1.4.0.). In this directory we implemented some extensions of the LiteralE model [4] which allow to train multimodal knowledge graph embeddings by also using textual information contained in entity descriptions. Details about the models and on how to install the extension files are available in this README.md file.
The extended library provides an implementation of the following models:

  • DistMultText: an extension of the DistMult model [5] for training KGEs by using entity descriptions attached to entities.
  • ComplExText: an extension of the ComplEx model [6] which allows to train KGEs by using information coming from short text descriptions attached to entities.
  • DistMult_gate_text: an extension of the DistMult model which allows to train KGEs by using information coming from short text descriptions and numeric value associated with entities in KGs.
    Entity descriptions are encoded by using SPECTER [7], a BERT language model for scientific documents.

Code

Scripts used in our research are available in the src directory. The disambiguation.py file in the src/disambiguation folder contains the functions that we developed for carrying author name disambiguation by using knowledge graph embeddings. More specifically it contains:

  • the do_blocking() function, which is used to preliminarily group the authors in the KG into different sub-sets by means of their last name and first initial,
  • the cluster_KGEs() function, which takes as input the output of the do_blocking function and disambiguates the authors by means of Knowledge Graph Embeddings and Hierarchical Agglomerative Clustering.
  • the evaluation functions that we used in our experiments. The src folder also contains the various scripts used for extracting the scholarly KGs from the original sources and creating an evaluation dataset for AND.

Results

Knowledge Graph Embedding Evaluation

For evaluating the quality of our KGE models in representing the components of the studied KGs, OpenCitations-782K and AMiner-534K, we used entity prediction, one of the most common KG-completion tasks. In our experiments, we compared three architectures:

  • A DistMult model trained with only structural triples, i.e. triples connecting just two entities.
  • A DistMultText model which was trained by using titles of scholarly resources, i.e. journals and publications, along with structural triples.
  • A DistMult_gate_text model which was trained using titles and publication dates of scholarly resources in order to leverage the representations of the entities associated with them.
    Hyper-parameters were obtained by doing hyper-parameter optimization with PyKEEN. Details about the configuration files are available in the kge-evaluation folder.

The following table shows the results of our experiments for OC-782K.

Model MR MRR [email protected] [email protected] [email protected] [email protected]
DistMult 59901 0.3570 0.3157 0.3812 0.402 0.4267
DistMultText 60495 0.3568 0.3158 0.3809 0.4013 0.4252
DistMult_gate_text 61812 0.3534 0.3130 0.3767 0.3971 0.4218

The following table shows the results of our experiments for AMiner-534K.

Model MR MRR [email protected] [email protected] [email protected] [email protected]
DistMult 3585 0.3285 0.1938 0.3996 0.4911 0.5940
DistMultText 3474 0.3443 0.2139 0.4123 0.5014 0.6019
DistMult_gate_text 3560 0.3452 0.2163 0.4123 0.5009 0.6028

Author Name Disambiguation

We compared our architecture for Author Name Disambiguation (AND) for KGs with a simple Rule-based method inspired by Caron and Van Eck [8] on OC-782K and with other state-of-the-art graph embedding models on the AMiner benchmark (results taken from [9]). The results are reported below.

Model Precision Recall F1
Caron & Van Eck [8] 84.66 50.20 63.03
DistMult 91.71 67.11 77.50
DistMultText 89.63 66.98 76.67
DistMult_gate_text 82.76 67.59 74.40

Model Precision Recall F1
Zhang and Al Hasan [10] 70.63 59.53 62.81
Zhang et Al. [9] 77.96 63.03 67.79
DistMult 78.36 59.68 63.36
DistMultText 77.24 61.21 64.18
DistMult_gate_text 77.62 59.91 63.07

References

[1] Massari, Arcangelo. (2021). Bibliographic dataset based on Scientometrics, containing provenance information compliant with the OpenCitations Data Model and non disambigued authors (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5151264

[2] Santini, Cristian, Alam, Mehwish, Gesese, Genet Asefa, Peroni, Silvio, Gangemi, Aldo, & Sack, Harald. (2021). OC-782K: Knowledge Graph of "Scientometrics" modelled according to the OpenCitations Data Model [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5675787

[3] Santini, Cristian, Alam, Mehwish, Gesese. Genet Asefa, Peroni, Silvio, Gangemi, Aldo, & Sack, Harald. (2021). AMiner-534K: Knowledge Graph of AMiner benchmark for Author Name Disambiguation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5675801

[4] Kristiadi A., Khan M.A., Lukovnikov D., Lehmann J., Fischer A. (2019) Incorporating Literals into Knowledge Graph Embeddings. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_20.

[5] Yang, B., Yih, W., He, X., Gao, J., & Deng, L. (2015). Embedding Entities and Relations for Learning and Inference in Knowledge Bases. ArXiv:1412.6575 [Cs]. http://arxiv.org/abs/1412.6575

[6] Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., & Bouchard, G. (2016). Complex Embeddings for Simple Link Prediction. ArXiv:1606.06357 [Cs, Stat]. http://arxiv.org/abs/1606.06357

[7] Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). SPECTER: Document-level Representation Learning using Citation-informed Transformers. ArXiv:2004.07180 [Cs]. http://arxiv.org/abs/2004.07180

[8] Caron, E., & van Eck, N.-J. (2014). Large scale author name disambiguation using rule-based scoring and clustering: International conference on science and technology indicators. Proceedings of the Science and Technology Indicators Conference 2014, 79–86. http://sti2014.cwts.nl

[9] Zhang, Y., Zhang, F., Yao, P., & Tang, J. (2018). Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1002–1011. https://doi.org/10.1145/3219819.3219859

[10] Zhang, B., & Al Hasan, M. (2017). Name Disambiguation in Anonymized Graphs using Network Embedding. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1239–1248. https://doi.org/10.1145/3132847.3132873

Acknowledgments

The software and data here available are the result of a master thesis carried in collaboration between the FICLIT department of the University of Bologna and the research department FIZ - Information Service Engineering (ISE) of the Karlsruhe Institute of Technology (KIT). The thesis has been supervised by Prof. Aldo Gangemi and Prof. Silvio Peroni from the University of Bologna, and Prof. Harald Sack, Dr. Mehwish Alam and Genet Asefa Gesese from FIZ-ISE.

PyTorch implementation for our paper Learning Character-Agnostic Motion for Motion Retargeting in 2D, SIGGRAPH 2019

Learning Character-Agnostic Motion for Motion Retargeting in 2D We provide PyTorch implementation for our paper Learning Character-Agnostic Motion for

Rundi Wu 367 Dec 22, 2022
Relative Uncertainty Learning for Facial Expression Recognition

Relative Uncertainty Learning for Facial Expression Recognition The official implementation of the following paper at NeurIPS2021: Title: Relative Unc

35 Dec 28, 2022
PyTorch Implementation for "ForkGAN with SIngle Rainy NIght Images: Leveraging the RumiGAN to See into the Rainy Night"

ForkGAN with Single Rainy Night Images: Leveraging the RumiGAN to See into the Rainy Night By Seri Lee, Department of Engineering, Seoul National Univ

Seri Lee 52 Oct 12, 2022
Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Improved Training of Wasserstein GANs Code for reproducing experiments in "Improved Training of Wasserstein GANs". Prerequisites Python, NumPy, Tensor

Ishaan Gulrajani 2.2k Jan 01, 2023
Python and C++ implementation of "MarkerPose: Robust real-time planar target tracking for accurate stereo pose estimation". Accepted at LXCV @ CVPR 2021.

MarkerPose: Robust real-time planar target tracking for accurate stereo pose estimation This is a PyTorch and LibTorch implementation of MarkerPose: a

Jhacson Meza 47 Nov 18, 2022
Subpopulation detection in high-dimensional single-cell data

PhenoGraph for Python3 PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") repr

Dana Pe'er Lab 42 Sep 05, 2022
A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Kordel K. France 2 Nov 14, 2022
Effect of Different Encodings and Distance Functions on Quantum Instance-based Classifiers

Effect of Different Encodings and Distance Functions on Quantum Instance-based Classifiers The repository contains the code to reproduce the experimen

Alessandro Berti 4 Aug 24, 2022
Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend

Hyperopt for solving CIFAR-100 with a convolutional neural network (CNN) built with Keras and TensorFlow, GPU backend This project acts as both a tuto

Guillaume Chevalier 103 Jul 22, 2022
Temporally Coherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Duc Linh Nguyen 2 Jan 18, 2022
Let's Git - Versionsverwaltung & Open Source Hausaufgabe

Let's Git - Versionsverwaltung & Open Source Hausaufgabe Herzlich Willkommen zu dieser Hausaufgabe für unseren MOOC: Let's Git! Wir hoffen, dass Du vi

1 Dec 13, 2021
Repository of best practices for deep learning in Julia, inspired by fastai

FastAI Docs: Stable | Dev FastAI.jl is inspired by fastai, and is a repository of best practices for deep learning in Julia. Its goal is to easily ena

FluxML 532 Jan 02, 2023
git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

233 Dec 29, 2022
Experimental code for paper: Generative Adversarial Networks as Variational Training of Energy Based Models

Experimental code for paper: Generative Adversarial Networks as Variational Training of Energy Based Models, under review at ICLR 2017 requirements: T

Shuangfei Zhai 18 Mar 05, 2022
Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

Peng Qiao 1 Dec 14, 2021
Symbolic Music Generation with Diffusion Models

Symbolic Music Generation with Diffusion Models Supplementary code release for our work Symbolic Music Generation with Diffusion Models. Installation

Magenta 119 Jan 07, 2023
An end-to-end library for editing and rendering motion of 3D characters with deep learning [SIGGRAPH 2020]

Deep-motion-editing This library provides fundamental and advanced functions to work with 3D character animation in deep learning with Pytorch. The co

1.2k Dec 29, 2022
Implementation of Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021)

PSWE: Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021) PSWE is a permutation-invariant feature aggregation/pooling method based on sliced-Wasser

Navid Naderializadeh 3 May 06, 2022
AFL binary instrumentation

E9AFL --- Binary AFL E9AFL inserts American Fuzzy Lop (AFL) instrumentation into x86_64 Linux binaries. This allows binaries to be fuzzed without the

242 Dec 12, 2022
Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Learned Virtual View Visibility ICCV2021

Vis2Mesh This is the offical repository of the paper: Vis2Mesh: Efficient Mesh Reconstruction from Unstructured Point Clouds of Large Scenes with Lear

71 Dec 25, 2022