Reaction SMILES-AA mapping via language modelling

Last update: Dec 13, 2022

Related tags

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

HeartRate detector with ArduinoandPython - Use Arduino and Python create a heartrate detector.

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

A PyTorch-based library for semi-supervised learning

PyTorch implementation of the implicit Q-learning algorithm (IQL)

MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

Road Crack Detection Using Deep Learning Methods

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Finetuning Pipeline

A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019)

Arabic Car License Recognition. A solution to the kaggle competition Machathon 3.0.

Unimodal Face Classification with Multimodal Training

Face Mask Detection on Image and Video using tensorflow and keras

Implementation of the method proposed in the paper "Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation"

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Official Implementation of VAT

Self-supervised learning (SSL) is a method of machine learning

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

HeartRate detector with ArduinoandPython - Use Arduino and Python create a heartrate detector.

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

A PyTorch-based library for semi-supervised learning

PyTorch implementation of the implicit Q-learning algorithm (IQL)

MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

Road Crack Detection Using Deep Learning Methods

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Finetuning Pipeline

A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019)

Arabic Car License Recognition. A solution to the kaggle competition Machathon 3.0.

Unimodal Face Classification with Multimodal Training

Face Mask Detection on Image and Video using tensorflow and keras

Implementation of the method proposed in the paper "Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation"

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Official Implementation of VAT

Self-supervised learning (SSL) is a method of machine learning

A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`