Explainer for black box models that predict molecule properties

Last update: Dec 19, 2022

Related tags

Overview

Explaining why that molecule

exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help users understand why a molecule is predicted to have a property.

Install

pip install exmol

Counterfactual Generation

Our package implements the Model Agnostic Counterfactual Compounds with STONED (MACCS) to generate counterfactuals. A counterfactual can explain a prediction by showing what would have to change in the molecule to change its predicted class. Here is an eample of a counterfactual:

This package is not popular. If the package had a logo, it would be popular.

In addition to having a changed prediction, a molecular counterfactual must be similar to its base molecule as much as possible. Here is an example of a molecular counterfactual:

The counterfactual shows that if the carboxylic acid were an ester, the molecule would be active. It is up to the user to translate this set of structures into a meaningful sentence.

Usage

Let's assume you have a deep learning model my_model(s) that takes in one SMILES string and outputs a predicted binary class. To generate counterfactuals, we need to wrap our function so that it can take both SMILES and SELFIES, but it only needs to use one.

We first expand chemical space around the prediction of interest

import exmol

# mol of interest
base = 'CCCO'

samples = exmol.sample_space(base, lambda smi, sel: my_model(smi), batched=False)

Here we use a lambda to wrap our function and indicate our function can only take one SMILES string, not a list of them with batched=False. Now we select counterfactuals from that space and plot them.

cfs = exmol.cf_explain(samples)
exmol.plot_cf(cfs)

We can also plot the space around the counterfactual. This is computed via PCA of the affinity matrix -- the similarity with the base molecule. Due to how similarity is calculated, the base is going to be the farthest from all other molecules. Thus your base should fall on the left (or right) extreme of your plot.

cfs = exmol.cf_explain(samples)
exmol.plot_space(samples, cfs)

Each counterfactual is a Python dataclass with information allowing it to be used in your own analysis:

print(cfs[0])

Examples(
  smiles='CCOC(=O)c1ccc(N=CN(Cl)c2ccccc2)cc1',
  selfies='[C][C][O][C][Branch1_2][C][=O][C][=C][C][=C][Branch1_1][#C][N][=C][N][Branch1_1][C][Cl][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][Ring1][S]',
  similarity=0.8181818181818182,
  yhat=-5.459493637084961,
  index=1807,
  position=array([-6.11371691,  1.24629293]),
  is_origin=False,
  cluster=26,
  label='Counterfactual')

Chemical Space

When calling exmol.sample_space you can pass preset=<preset>, which can be one of the following:

'narrow': Only one change to molecular structure, reduced set of possible bonds/elements
'medium': Default. One or two changes to molecular structure, reduced set of possible bonds/elements
'wide': One through five changes to molecular structure, large set of possible bonds/elements
'chemed': A restrictive set where only pubchem molecules are considered. Experimental

You can also pass num_samples as a "request" for number of samples. You will typically end up with less due to degenerate molecules. See API for complete description.

SVG

Molecules are by default drawn as PNGs. If you would like to have them drawn as SVGs, call insert_svg after calling plot_space or plot_cf

import skunk
exmol.plot_cf(exps)
svg = exmol.insert_svg(exps, mol_fontsize=16)

# for Jupyter Notebook
skunk.display(svg)

# To save to file
with open('myplot.svg', 'w') as f:
    f.write(svg)

This is done with the skunk 🦨 library.

API and Docs

Read API here. You should also read the paper (see below) for a more exact description of the methods and implementation.

Citation

Please cite Wellawatte et al.

 @article{wellawatte_seshadri_white_2021,
 place={Cambridge},
 title={Model agnostic generation of counterfactual explanations for molecules},
 DOI={10.33774/chemrxiv-2021-4qkg8},
 journal={ChemRxiv},
 publisher={Cambridge Open Engage},
 author={Wellawatte, Geemi P and Seshadri, Aditi and White, Andrew D},
 year={2021}}

This content is a preprint and has not been peer-reviewed.

Comments

Add LIME explanations
This is a big PR!

[x] Document LIME function

[x] Compute t-stats using examples that have non-zero weights

[x] Add plotting code for descriptors - needs SMARTS annotations for MACCS keys (166 files)

[x] Add plotting code for chemical space and fit

[x] Description in readme

[x] Clean up notebooks and add documentation

[x] Remove extra files

[x] Add LIME notebooks to CI?
opened by hgandhi2411 11

Error while plotting counterfactuals using plot_cf()

plot_cf() function errors out with the following error. This behavior is also consistent across all notebooks in paper/.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-b6c8ed26216e> in <module>
      1 fkw = {"figsize": (8, 6)}
      2 mpl.rc("axes", titlesize=12)
----> 3 exmol.plot_cf(exps, figure_kwargs=fkw, mol_size=(450, 400), nrows=1)
      4 
      5 plt.savefig("rf-simple.png", dpi=180)

/gpfs/fs2/scratch/hgandhi/exmol/exmol/exmol.py in plot_cf(exps, fig, figure_kwargs, mol_size, mol_fontsize, nrows, ncols)
    682         title += f"\nf(x) = {e.yhat:.3f}"
    683         axs[i].set_title(title)
--> 684         axs[i].imshow(np.asarray(img), gid=f"rdkit-img-{i}")
    685         axs[i].axis("off")
    686     for j in range(i, C * R):

~/.local/lib/python3.7/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
   1359     def inner(ax, *args, data=None, **kwargs):
   1360         if data is None:
-> 1361             return func(ax, *map(sanitize_sequence, args), **kwargs)
   1362 
   1363         bound = new_sig.bind(ax, *args, **kwargs)

~/.local/lib/python3.7/site-packages/matplotlib/axes/_axes.py in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, filternorm, filterrad, resample, url, **kwargs)
   5607                               resample=resample, **kwargs)
   5608 
-> 5609         im.set_data(X)
   5610         im.set_alpha(alpha)
   5611         if im.get_clip_path() is None:

~/.local/lib/python3.7/site-packages/matplotlib/image.py in set_data(self, A)
    699                 not np.can_cast(self._A.dtype, float, "same_kind")):
    700             raise TypeError("Image data of dtype {} cannot be converted to "
--> 701                             "float".format(self._A.dtype))
    702 
    703         if self._A.ndim == 3 and self._A.shape[-1] == 1:

TypeError: Image data of dtype <U14622 cannot be converted to float

opened by hgandhi2411 6

Error after installation

Hi,

First at all, thank you for your work!. I am obtaining a problem installing your library, o better say when I do "import exmol", I obtaing one error:"No module named 'dataclasses'".

I have installed as: pip install exmol...

Thanks!

opened by PARODBE 6
CODEX Example

While messing around with CODEX, I noticed it wants to compute ECFP4 fingerprints using a different method and this gives slightly different similarities. @geemi725 could you double-check the ECFP4 implementation we have is correct, or is the CODEX one correct?

opened by whitead 6
Object has no attribute '__code__'
Hi there, I noticed that sample_space does not seem to work with class instances, because they do not have a __code__ attribute:

import exmol class A: pass exmol.sample_space('C', A(), batched=True)

AttributeError: 'A' object has no attribute '__code__'

Is there any way around this other than forcing the call to a separate function?
opened by oiao 5
The module 'exmol' has no attribute 'lime_explain'

In the notebook RF-lime.ipynb, the command

exmol.lime_explain(space, descriptor_type=descriptor_type)

gives a error module 'exmol' has no attribute 'lime_explain'

Please, let me know how to fix this error. Thanks.

opened by andresilvapimentel 5
Easier usage of explain
Working through some examples, I've noted the following things:

Descriptor type should have a default - maybe MACCS since the plots will show-up

Maybe we should only save SVGs, rather than return unless prompted

We should do string comparison for descriptor types using lowercase strings, so that classic and Classic and ecfp are valid.

We probably shouldn't save without a filename - it is unexpected
opened by whitead 4
Allow using custom list of molecules
Hello @whitead, this is very nice package !

I found the new chemed option very useful and thought extending it to any list of molecule would make sense.

Here is the main change to the API:

explanation = exmol.sample_space( "CCCC", model, preset="custom", #use custom preset batched=False, data=data, # provide list of smiles or molecules )

Let me know if this PR make sense.
opened by maclandrol 4
Target molecule frequently on the edge of sample space visualization

In your example provided in the code, the target molecule is on the edge of the sampled distribution (in the PCA plot). I also find this happens very frequently with my experiments on my model. I think this suggests that the sampling produces molecules that are not evenly distributed around the target. I just want to verify that this is a property of the STONED sampling algorithm, and not an artifact of the visualization code (which it does not seem to be). I've attached an example of my own, for both "narrow" and "medium" presets.

preset="narrow", nmols=10

preset="medium", nmols=10

opened by adamoyoung 3
Sanitizing SMILES removes chirality information

On this line of sample_space(), chirality information of origin_smiles is removed. The output is then unsuitable as input to a chirality-aware ML model, e.g. to distinguish L vs. D amino acids which are important in models of binding affinity. Could the option to skip this sanitization step be provided to the user?

PS: Great code base and beautiful visualizations! We're finding it very useful in explaining our Gaussian Process models. The future of SAR ←→ ML looks exciting.

opened by tianyu-lu 2
Release 0.5.0 on pypi

Are you planning to release 0.5.0 on pypi? I am maintaining the conda package of exmol and I would like to bump it to 0.5.0. See https://github.com/conda-forge/exmol-feedstock

Thanks!

opened by hadim 2
run_STONED couldn't generate SMILES after 30 minutes

For certain SMILES, run_STONED() failed to generate after running for so long. So far, one SMILES known to cause such issue is

[Na+].[Na+].[Na+].[Na+].[Na+].[O-][S](=O)(=O)OCC[S](=O)(=O)c1cccc(Nc2nc(Cl)nc(Nc3cc(cc4C=C(\C(=N/Nc5ccc6c(cccc6[S]([O-])(=O)=O)c5[S]([O-])(=O)=O)C(=O)c34)[S]([O-])(=O)=O)[S]([O-])(=O)=O)n2)c1

Here is how I use the function: exmol.run_stoned(smiles, num_samples=10, max_mutations=1).

opened by qcampbel 2

Releases(v2.2.1)

v2.2.1(Dec 7, 2022)
Fixed bug in sorting for text explanations

Fixed empty plot names saying None

Added priority for naming and removed invalid names

Added more names (metyhl, ethyl, butyl, etc)

Fixed sample_space to accept partials or objects

Added openai prompting

Added name_morgan_bit as external facing

Source code(tar.gz)
Source code(zip)
v2.2.0(Nov 1, 2022)
Added natural language explanation method

Added names to ECFP plots and naming of ECFP fragments

Source code(tar.gz)
Source code(zip)
v2.1.1(Jun 4, 2022)
Fixed bug in plot_descriptors

Source code(tar.gz)
Source code(zip)
v2.1.0(Jun 3, 2022)
plot_descriptors will no longer save to file without filename

Source code(tar.gz)
Source code(zip)
v2.0.1(May 31, 2022)

Made default run_stoned argument use basic instead of semantically robust alphabet, as claimed in documentation
Source code(tar.gz)
Source code(zip)
v2.0.0(May 31, 2022)
Added surrogate model explanation method

Added support for attributing ECFP, MACCS fingerprints, rdkit descriptors and plotting them

Example notebooks for new method

Fixed chirality stripping in sanitize

Made it possible to use multiple base molecules for ECFP descriptors

Source code(tar.gz)
Source code(zip)
v2.0.0.dev2(May 18, 2022)
Pre-release 2

Added surrogate model explanation method

Added support for attributing ECFP, MACCS fingerprints, rdkit descriptors and plotting them

Example notebooks for new method

Fixed chirality stripping in sanitize

Made it possible to use multiple base molecules for ECFP descriptors

Source code(tar.gz)
Source code(zip)
v2.0.0-dev1(May 6, 2022)
Pre-release

Added surrogate model explanation method

Added support for attributing ECFP, MACCS fingerprints, rdkit descriptors and plotting them

Example notebooks for new method

Source code(tar.gz)
Source code(zip)
v1.1.0(May 3, 2022)

Removed need for model(s) functions to take both SMILES and SELFIES.
Source code(tar.gz)
Source code(zip)
v1.0.2(May 3, 2022)
Switched to bulk Tanimoto to improve speed

Tightened chemed api limit

Source code(tar.gz)
Source code(zip)
v1.0.1(Apr 4, 2022)
Added new quiet mode to disable progress bars

Source code(tar.gz)
Source code(zip)
v1.0.0(Jan 21, 2022)
Possibly Breaking changes

Switched to SELFIES v2.0, so custom alphabets will need to be ported.

Changes

Removed "experimental" tag from Chemed and Custom methods

Bug fixes

Type annotations now pass mypy

Paper models now generate SVGs correctly and fixed token issues

Source code(tar.gz)
Source code(zip)
v0.6.0(Jan 17, 2022)
Changed behavior of num_samples so that it is not affected by mutation count in STONED

Source code(tar.gz)
Source code(zip)
v0.5.2(Jan 4, 2022)
Fixed SMILES escaping in URL in chemed

Source code(tar.gz)
Source code(zip)
v0.5.1(Nov 23, 2021)
Fixed similarity float vs int in chemed

Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 28, 2021)
Added custom lists for counterfactual source (contributed by @maclandrol)

Source code(tar.gz)
Source code(zip)
v0.4.01(Sep 29, 2021)
Fixed bug in sequence mutation lengths in STONED algorithm

Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 17, 2021)
Refactored code into files

Added SVG rewrite so mol structures are SVGs

SVGs are handled with skunks

Added cartoon style for scatter plot in plot_space

Source code(tar.gz)
Source code(zip)
v0.3.2(Sep 2, 2021)
Added SMILES sanitization before generating sample space

Source code(tar.gz)
Source code(zip)
v0.3.1(Aug 26, 2021)
Fixed num_samples not passed correctly

Fixed unparsable SMILES coming from pubchem

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 26, 2021)

Added CHEMED method and progress bar
Source code(tar.gz)
Source code(zip)
v0.3.0-dev1(Aug 24, 2021)
Added new "ZINCED" method for restricting chemical space to purchasable compounds

Added progress bar

Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 14, 2021)

Initial packaged release
Source code(tar.gz)
Source code(zip)

Owner

White Laboratory

GitHub Repository https://ur-whitelab.github.io/exmol/

Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

2 Jan 01, 2022

Convert onnx models to pytorch.

onnx2torch onnx2torch is an ONNX to PyTorch converter. Our converter: Is easy to use – Convert the ONNX model with the function call convert; Is easy

264 Dec 30, 2022

Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble This is the code for reproducing the results of the paper Uncertainty-Bas

43 Nov 23, 2022

"Neural Turing Machine" in Tensorflow

Neural Turing Machine in Tensorflow Tensorflow implementation of Neural Turing Machine. This implementation uses an LSTM controller. NTM models with m

1k Dec 06, 2022

Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021)

Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021) Contact 0 Jan 11, 2022

Python PID Tuner - Makes a model of the System from a Process Reaction Curve and calculates PID Gains

PythonPID_Tuner_SOPDT Step 1: Takes a Process Reaction Curve in csv format - assumes data at 100ms interval (column names CV and PV) Step 2: Makes a r

1 Jan 18, 2022

This is a simple framework to make object detection dataset very quickly

FastAnnotation Table of contents General info Requirements Setup General info This is a simple framework to make object detection dataset very quickly

1 Jan 24, 2022

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark We propose a benchmark to evaluate different quantization algorithms on vari

494 Dec 29, 2022

Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

ARAE Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun https://arxiv.org/abs/1706.04223 Disc

399 Jan 02, 2023

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

62 Dec 12, 2022

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

9 Oct 11, 2022

PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability

PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability PCACE is a new algorithm for ranking neurons in a CNN architecture in order

4 Jan 04, 2022

PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚡🔥⚡

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

2.1k Jan 09, 2023

Explainer for black box models that predict molecule properties

Related tags

Overview

Explaining why that molecule

Install

Counterfactual Generation

Usage

Chemical Space

SVG

API and Docs

Citation

Comments

Releases(v2.2.1)

v2.2.1(Dec 7, 2022)

v2.2.0(Nov 1, 2022)

v2.1.1(Jun 4, 2022)

v2.1.0(Jun 3, 2022)

v2.0.1(May 31, 2022)

v2.0.0(May 31, 2022)

v2.0.0.dev2(May 18, 2022)

v2.0.0-dev1(May 6, 2022)

v1.1.0(May 3, 2022)

v1.0.2(May 3, 2022)

v1.0.1(Apr 4, 2022)

v1.0.0(Jan 21, 2022)

Possibly Breaking changes

Changes

Bug fixes

v0.6.0(Jan 17, 2022)

v0.5.2(Jan 4, 2022)

v0.5.1(Nov 23, 2021)

v0.5.0(Oct 28, 2021)

v0.4.01(Sep 29, 2021)

v0.4.0(Sep 17, 2021)

v0.3.2(Sep 2, 2021)

v0.3.1(Aug 26, 2021)

v0.3.0(Aug 26, 2021)

v0.3.0-dev1(Aug 24, 2021)

v0.2.0(Aug 14, 2021)

Owner

White Laboratory

Videocaptioning.pytorch - A simple implementation of video captioning

Convert onnx models to pytorch.

Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

"Neural Turing Machine" in Tensorflow

Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021)

Python PID Tuner - Makes a model of the System from a Process Reaction Curve and calculates PID Gains

This is a simple framework to make object detection dataset very quickly

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability

PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚡🔥⚡

Arbitrary Distribution Modeling with Censorship in Real Time 59 2 60 3 Bidding Advertising for KDD'21

A High-Quality Real Time Upscaler for Anime Video

(NeurIPS 2021) Pytorch implementation of paper "Re-ranking for image retrieval and transductive few-shot classification"

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement (ICCV 2021, Oral)

Baselines for TrajNet++

We are More than Our JOints: Predicting How 3D Bodies Move

Spam your friends and famly and when you do your famly will disown you and you will have no friends.