Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

Related tags

Deep LearningConfGF
Overview

ConfGF


License: MIT

[PDF] | [Slides]

The official implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021 Long talk)

Installation

Install via Conda (Recommended)

# Clone the environment
conda env create -f env.yml

# Activate the environment
conda activate confgf

# Install Library
git clone https://github.com/DeepGraphLearning/ConfGF.git
cd ConfGF
python setup.py install

Install Manually

# Create conda environment
conda create -n confgf python=3.7

# Activate the environment
conda activate confgf

# Install packages
conda install -y -c pytorch pytorch=1.7.0 torchvision torchaudio cudatoolkit=10.2
conda install -y -c rdkit rdkit==2020.03.2.0
conda install -y scikit-learn pandas decorator ipython networkx tqdm matplotlib
conda install -y -c conda-forge easydict
pip install pyyaml

# Install PyTorch Geometric
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-geometric==1.6.3

# Install Library
git clone https://github.com/DeepGraphLearning/ConfGF.git
cd ConfGF
python setup.py install

Dataset

Offical Dataset

The offical raw GEOM dataset is avaiable [here].

Preprocessed dataset

We provide the preprocessed datasets (GEOM, ISO17) in a [google drive folder]. For ISO17 dataset, we use the default split of [GraphDG].

Prepare your own GEOM dataset from scratch (optional)

Download the raw GEOM dataset and unpack it.

tar xvf ~/rdkit_folder.tar.gz -C ~/GEOM

Preprocess the raw GEOM dataset.

python script/process_GEOM_dataset.py --base_path GEOM --dataset_name qm9 --confmin 50 --confmax 500
python script/process_GEOM_dataset.py --base_path GEOM --dataset_name drugs --confmin 50 --confmax 100

The final folder structure will look like this:

GEOM
|___rdkit_folder  # raw dataset
|   |___qm9 # raw qm9 dataset
|   |___drugs # raw drugs dataset
|   |___summary_drugs.json
|   |___summary_qm9.json
|   
|___qm9_processed
|   |___train_data_40k.pkl
|   |___val_data_5k.pkl
|   |___test_data_200.pkl
|   
|___drugs_processed
|   |___train_data_39k.pkl
|   |___val_data_5k.pkl
|   |___test_data_200.pkl
|
iso17_processed
|___iso17_split-0_train_processed.pkl
|___iso17_split-0_test_processed.pkl
|
...

Training

All hyper-parameters and training details are provided in config files (./config/*.yml), and free feel to tune these parameters.

You can train the model with the following commands:

python -u script/train.py --config_path ./config/qm9_default.yml
python -u script/train.py --config_path ./config/drugs_default.yml
python -u script/train.py --config_path ./config/iso17_default.yml

The checkpoint of the models will be saved into a directory specified in config files.

Generation

We provide the checkpoints of three trained models, i.e., qm9_default, drugs_default and iso17_default in a [google drive folder].

You can generate conformations of a molecule by feeding its SMILES into the model:

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGF --smiles c1ccccc1
python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGFDist --smiles c1ccccc1

Here we use the models trained on GEOM-QM9 to generate conformations for the benzene. The argument --generator indicates the type of the generator, i.e., ConfGF vs. ConfGFDist. See the ablation study (Table 5) in the original paper for more details.

You can also generate conformations for an entire test set.

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGF \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGFDist \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/drugs_default.yml --generator ConfGF \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/drugs_default.yml --generator ConfGFDist \
                        --start 0 --end 200 \

Here start and end indicate the range of the test set that we want to use. All hyper-parameters related to generation can be set in config files.

Conformations of some drug-like molecules generated by ConfGF are provided below.

Get Results

The results of all benchmark tasks can be calculated based on generated conformations.

We report the results of each task in the following tables. Results of ConfGF and ConfGFDist are re-evaluated based on the current code base, which successfully reproduce the results reported in the original paper. Results of other models are taken directly from the original paper.

Task 1. Conformation Generation

The COV and MAT scores on the GEOM datasets can be calculated using the following commands:

python -u script/get_task1_results.py --input dir_of_QM9_samples --core 10 --threshold 0.5  

python -u script/get_task1_results.py --input dir_of_Drugs_samples --core 10 --threshold 1.25  

Table: COV and MAT scores on GEOM-QM9

QM9 COV-Mean (%) COV-Median (%) MAT-Mean (\AA) MAT-Median (\AA)
ConfGF 91.06 95.76 0.2649 0.2668
ConfGFDist 85.37 88.59 0.3435 0.3548
CGCF 78.05 82.48 0.4219 0.3900
GraphDG 73.33 84.21 0.4245 0.3973
CVGAE 0.09 0.00 1.6713 1.6088
RDKit 83.26 90.78 0.3447 0.2935

Table: COV and MAT scores on GEOM-Drugs

Drugs COV-Mean (%) COV-Median (%) MAT-Mean (\AA) MAT-Median (\AA)
ConfGF 62.54 71.32 1.1637 1.1617
ConfGFDist 49.96 48.12 1.2845 1.2827
CGCF 53.96 57.06 1.2487 1.2247
GraphDG 8.27 0.00 1.9722 1.9845
CVGAE 0.00 0.00 3.0702 2.9937
RDKit 60.91 65.70 1.2026 1.1252

Task 2. Distributions Over Distances

The MMD metrics on the ISO17 dataset can be calculated using the following commands:

python -u script/get_task2_results.py --input dir_of_ISO17_samples

Table: Distributions over distances

Method Single-Mean Single-Median Pair-Mean Pair-Median All-Mean All-Median
ConfGF 0.3430 0.2473 0.4195 0.3081 0.5432 0.3868
ConfGFDist 0.3348 0.2011 0.4080 0.2658 0.5821 0.3974
CGCF 0.4490 0.1786 0.5509 0.2734 0.8703 0.4447
GraphDG 0.7645 0.2346 0.8920 0.3287 1.1949 0.5485
CVGAE 4.1789 4.1762 4.9184 5.1856 5.9747 5.9928
RDKit 3.4513 3.1602 3.8452 3.6287 4.0866 3.7519

Visualizing molecules with PyMol

Start Setup

  1. pymol -R
  2. Display - Background - White
  3. Display - Color Space - CMYK
  4. Display - Quality - Maximal Quality
  5. Display Grid
    1. by object: use set grid_slot, int, mol_name to put the molecule into the corresponding slot
    2. by state: align all conformations in a single slot
    3. by object-state: align all conformations and put them in separate slots. (grid_slot dont work!)
  6. Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5
  7. Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0

Show Molecule

  1. To show molecules

    1. hide everything
    2. show sticks
  2. To align molecules: align name1, name2

  3. Convert RDKit mol to Pymol

    from rdkit.Chem import PyMol
    v= PyMol.MolViewer()
    rdmol = Chem.MolFromSmiles('C')
    v.ShowMol(rdmol, name='mol')
    v.SaveFile('mol.pkl')

Make the trajectory for Langevin dynamics

  1. load a sequence of pymol objects named traj*.pkl into the PyMol, where traji.pkl is the i-th conformation in the trajectory.
  2. Join states: join_states mol, traj*, 0
  3. Delete useless object: delete traj*
  4. Movie - Program - State Loop - Full Speed
  5. Export the movie to a sequence of PNG files: File - Export Movie As - PNG Images
  6. Use photoshop to convert the PNG sequence to a GIF with the transparent background.

Citation

Please consider citing the following paper if you find our codes helpful. Thank you!

@inproceedings{shi*2021confgf,
title={Learning Gradient Fields for Molecular Conformation Generation},
author={Shi, Chence and Luo, Shitong and Xu, Minkai and Tang, Jian},
booktitle={International Conference on Machine Learning},
year={2021}
}

Contact

Chence Shi ([email protected])

Owner
MilaGraph
Research group led by Prof. Jian Tang at Mila-Quebec AI Institute (https://mila.quebec/) focusing on graph representation learning and graph neural networks.
MilaGraph
S2s2net - Sentinel-2 Super-Resolution Segmentation Network

S2S2Net Sentinel-2 Super-Resolution Segmentation Network Getting started Install

Wei Ji 10 Nov 10, 2022
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Welcome to the cuQuantum repository! This public repository contains two sets of files related to the NVIDIA cuQuantum SDK: samples: All C/C++ sample

NVIDIA Corporation 147 Dec 27, 2022
Analyzes your GitHub Profile and presents you with a report on how likely you are to become the next MLH Fellow!

Fellowship Prediction GitHub Profile Comparative Analysis Tool Built with BentoML Table of Contents: Features Disclaimer Technologies Used Contributin

Damir Temir 51 Dec 29, 2022
Linear Variational State Space Filters

Linear Variational State Space Filters To set up the environment, use the provided scripts in the docker/ folder to build and run the codebase inside

0 Dec 13, 2021
A Python script that creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editing software such as FinalCut Pro for further adjustments.

Text to Subtitles - Python This python file creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editin

Dmytro North 9 Dec 24, 2022
KAPAO is an efficient multi-person human pose estimation model that detects keypoints and poses as objects and fuses the detections to predict human poses.

KAPAO (Keypoints and Poses as Objects) KAPAO is an efficient single-stage multi-person human pose estimation model that models keypoints and poses as

Will McNally 664 Dec 30, 2022
Human4D Dataset tools for processing and visualization

HUMAN4D: A Human-Centric Multimodal Dataset for Motions & Immersive Media HUMAN4D constitutes a large and multimodal 4D dataset that contains a variet

tofis 15 Nov 09, 2022
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

77 Dec 16, 2022
Python based framework for Automatic AI for Regression and Classification over numerical data.

Python based framework for Automatic AI for Regression and Classification over numerical data. Performs model search, hyper-parameter tuning, and high-quality Jupyter Notebook code generation.

BlobCity, Inc 141 Dec 21, 2022
Simulator for FRC 2022 challenge: Rapid React

rrsim Simulator for FRC 2022 challenge: Rapid React out-1.mp4 Usage In order to run the simulator use the following: python3 rrsim.py [config_path] wh

1 Jan 18, 2022
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.

mani 1.2k Jan 07, 2023
Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion Preface This directory provides an implementation of the algori

Jean-Samuel Leboeuf 0 Nov 03, 2021
Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22) Ok-Topk is a scheme for distributed training with sparse gradients

Shigang Li 9 Oct 29, 2022
Resources complimenting the Machine Learning Course led in the Faculty of mathematics and informatics part of Sofia University.

Machine Learning and Data Mining, Summer 2021-2022 How to learn data science and machine learning? Programming. Learn Python. Basic Statistics. Take a

Simeon Hristov 8 Oct 04, 2022
ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, ICCV 2021 Update: 2021/03/11: update our new results. Now our T2T-ViT-14 w

YITUTech 1k Dec 31, 2022
Official pytorch implementation of paper Dual-Level Collaborative Transformer for Image Captioning (AAAI 2021).

Dual-Level Collaborative Transformer for Image Captioning This repository contains the reference code for the paper Dual-Level Collaborative Transform

lyricpoem 160 Dec 11, 2022
Software associated to AAAI paper "Planning with Biological Neurons and Synapses"

jBrain Software associated with the AAAI 2022 paper Francesco D'Amore, Daniel Mitropolsky, Pierluigi Crescenzi, Emanuele Natale, Christos H. Papadimit

Pierluigi Crescenzi 1 Apr 10, 2022
Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Poisson-solver-2D Finite difference solution of 2D Poisson equation Current version can handle Dirichlet, Neumann, and mixed (combination of Dirichlet

Mohammad Asif Zaman 34 Dec 23, 2022
SweiNet is an uncertainty-quantifying shear wave speed (SWS) estimator for ultrasound shear wave elasticity (SWE) imaging.

SweiNet SweiNet is an uncertainty-quantifying shear wave speed (SWS) estimator for ultrasound shear wave elasticity (SWE) imaging. SweiNet takes as in

Felix Jin 3 Mar 31, 2022