OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

Related tags

Deep LearningDrugOOD
Overview

πŸ”₯ DrugOOD πŸ”₯ : OOD Dataset Curator and Benchmark for AI Aided Drug Discovery

This is the official implementation of the DrugOOD project, this is the project page: https://drugood.github.io/

Environment Installation

You can install the conda environment using the drugood.yaml file provided:

!git clone https://github.com/tencent-ailab/DrugOOD.git
!cd DrugOOD
!conda env create --name drugood --file=drugood.yaml
!conda activate drugood

Then you can go to the demo at demo/demo.ipynb which gives a quick practice on how to use DrugOOD.

Demo

For a quick practice on using DrugOOD for dataset curation and OOD benchmarking, one can refer to the demo/demo.ipynb.

Dataset Curator

First, you need to generate the required DrugOOD dataset with our code. The dataset curator currently focusing on generating datasets from CHEMBL. It supports the following two tasks:

  • Ligand Based Affinity Prediction (LBAP).
  • Structure Based Affinity Prediction (SBAP).

For OOD domain annotations, it supports the following 5 choices.

  • Assay.
  • Scaffold.
  • Size.
  • Protein. (only for SBAP task)
  • Protein Family. (only for SBAP task)

For noise annotations, it supports the following three noise levels. Datasets with different noises are implemented by filters with different levels of strictness.

  • Core.
  • Refined.
  • General.

At the same time, due to the inconvenient conversion between different measurement type (E.g. IC50, EC50, Ki, Potency), one needs to specify the measurement type when generating the dataset.

How to Run and Reproduce the 96 Datasets?

Firstly, specifiy the path of CHEMBL database and the directory to save the data in the configuration file: configs/_base_/curators/lbap_defaults.py for LBAP task or configs/_base_/curators/sbap_defaults.py for SBAP task.
The source_root="YOUR_PATH/chembl_29_sqlite/chembl_29.db" means the path to the chembl29 sqllite file. The target_root="data/" specifies the folder to save the generated data.

Note that you can download the original chembl29 database with sqllite format from http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_29/chembl_29_sqlite.tar.gz.

The built-in configuration files are located in:
configs/curators/. Here we provide the 96 config files to reproduce the 96 datasets in our paper. Meanwhile, you can also customize your own datasets by changing the config files.

Run tools/curate.py to generate dataset. Here are some examples:

Generate datasets for the LBAP task, with assay as domain, core as noise level, IC50 as measurement type, LBAP as task type.:

python tools/curate.py --cfg configs/curators/lbap_core_ic50_assay.py

Generate datasets for the SBAP task, with protein as domain, refined as noise level, EC50 as measurement type, SBAP as task type.:

python tools/curate.py --cfg configs/curator/sbap_refined_ec50_protein.py

Benchmarking SOTA OOD Algorithms

Currently we support 6 different baseline algorithms:

  • ERM
  • IRM
  • GroupDro
  • Coral
  • MixUp
  • DANN

Meanwhile, we support various GNN backbones:

  • GIN
  • GCN
  • Weave
  • ShcNet
  • GAT
  • MGCN
  • NF
  • ATi-FPGNN
  • GTransformer

And different backbones for protein sequence modeling:

  • Bert
  • ProteinBert

How to Run?

Firstly, run the following command to install.

python setup.py develop

Run the LBAP task with ERM algorithm:

python tools/train.py configs/algorithms/erm/lbap_core_ec50_assay_erm.py

If you would like to run ERM on other datasets, change the corresponding options inside the above config file. For example, ann_file = 'data/lbap_core_ec50_assay.json' specifies the input data.

Similarly, run the SBAP task with ERM algorithm:

python tools/train.py configs/algorithms/erm/sbap_core_ec50_assay_erm.py

Reference

πŸ˜„ If you find this repo is useful, please consider to cite our paper:

@ARTICLE{2022arXiv220109637J,
    author = {{Ji}, Yuanfeng and {Zhang}, Lu and {Wu}, Jiaxiang and {Wu}, Bingzhe and {Huang}, Long-Kai and {Xu}, Tingyang and {Rong}, Yu and {Li}, Lanqing and {Ren}, Jie and {Xue}, Ding and {Lai}, Houtim and {Xu}, Shaoyong and {Feng}, Jing and {Liu}, Wei and {Luo}, Ping and {Zhou}, Shuigeng and {Huang}, Junzhou and {Zhao}, Peilin and {Bian}, Yatao},
    title = "{DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations}",
    journal = {arXiv e-prints},
    keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Quantitative Biology - Quantitative Methods},
    year = 2022,
    month = jan,
    eid = {arXiv:2201.09637},
    pages = {arXiv:2201.09637},
    archivePrefix = {arXiv},
    eprint = {2201.09637},
    primaryClass = {cs.LG}
}

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
Official code of paper: MovingFashion: a Benchmark for the Video-to-Shop Challenge

SEAM Match-RCNN Official code of MovingFashion: a Benchmark for the Video-to-Shop Challenge paper Installation Requirements: Pytorch 1.5.1 or more rec

HumaticsLAB 31 Oct 10, 2022
Video2x - A lossless video/GIF/image upscaler achieved with waifu2x, Anime4K, SRMD and RealSR.

Official Discussion Group (Telegram): https://t.me/video2x A Discord server is also available. Please note that most developers are only on Telegram.

K4YT3X 5.9k Dec 31, 2022
Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

Autonomous Learning Group 65 Dec 27, 2022
ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

ToFFi Toolbox This repository contains "before peer review" version of the software related to the preprint of the publication ToFFi - Toolbox for Fre

4 Aug 31, 2022
Code associated with the paper "Towards Understanding the Data Dependency of Mixup-style Training".

Mixup-Data-Dependency Code associated with the paper "Towards Understanding the Data Dependency of Mixup-style Training". Running Alternating Line Exp

Muthu Chidambaram 0 Nov 11, 2021
In Search of Probeable Generalization Measures

In Search of Probeable Generalization Measures Exciting News! In Search of Probeable Generalization Measures has been accepted to the International Co

Mahdi S. Hosseini 6 Sep 11, 2022
Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).

DeepMTA_PyTorch Officical PyTorch Implementation of "Dynamic Attention-guided Multi-TrajectoryAnalysis for Single Object Tracking", Xiao Wang, Zhe Che

Xiao WangοΌˆηŽ‹ι€οΌ‰ 7 Dec 03, 2022
Self-supervised learning (SSL) is a method of machine learning

Self-supervised learning (SSL) is a method of machine learning. It learns from unlabeled sample data. It can be regarded as an intermediate form between supervised and unsupervised learning.

Ashish Patel 4 May 26, 2022
Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

Self-Supervised Reward Regression (SSRR) Codebase for CoRL 2021 paper "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression "

19 Dec 12, 2022
πŸ”€ Visual Room Rearrangement

AI2-THOR Rearrangement Challenge Welcome to the 2021 AI2-THOR Rearrangement Challenge hosted at the CVPR'21 Embodied-AI Workshop. The goal of this cha

AI2 55 Dec 22, 2022
Official repository for CVPR21 paper "Deep Stable Learning for Out-Of-Distribution Generalization".

StableNet StableNet is a deep stable learning method for out-of-distribution generalization. This is the official repo for CVPR21 paper "Deep Stable L

120 Dec 28, 2022
Exploring Visual Engagement Signals for Representation Learning

Exploring Visual Engagement Signals for Representation Learning Menglin Jia, Zuxuan Wu, Austin Reiter, Claire Cardie, Serge Belongie and Ser-Nam Lim C

Menglin Jia 9 Jul 23, 2022
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)

Transfer Learning for Text Classification with Tensorflow Tensorflow implementation of Semi-supervised Sequence Learning(https://arxiv.org/abs/1511.01

DONGJUN LEE 82 Oct 22, 2022
Spectral Temporal Graph Neural Network (StemGNN in short) for Multivariate Time-series Forecasting

Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting This repository is the official implementation of Spectral Temporal Gr

Microsoft 306 Dec 29, 2022
Official PyTorch implementation for "Low Precision Decentralized Distributed Training with Heterogenous Data"

Low Precision Decentralized Training with Heterogenous Data Official PyTorch implementation for "Low Precision Decentralized Distributed Training with

Aparna Aketi 0 Nov 23, 2021
Source code for "Pack Together: Entity and Relation Extraction with Levitated Marker"

PL-Marker Source code for Pack Together: Entity and Relation Extraction with Levitated Marker. Quick links Overview Setup Install Dependencies Data Pr

THUNLP 173 Dec 30, 2022
PyTorch implementation of the Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning This is the official PyTorch implementation of the ContrastiveCrop paper: @artic

249 Dec 28, 2022
Optimus: the first large-scale pre-trained VAE language model

Optimus: the first pre-trained Big VAE language model This repository contains source code necessary to reproduce the results presented in the EMNLP 2

314 Dec 19, 2022
SatelliteNeRF - PyTorch-based Neural Radiance Fields adapted to satellite domain

SatelliteNeRF PyTorch-based Neural Radiance Fields adapted to satellite domain.

Kai Zhang 46 Nov 20, 2022
Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

ADE20k Semantic segmentation with MAE Getting started Install the mmsegmentation

97 Dec 17, 2022