Benchmarking "One Molecular Fingerprint to Rule Them All"

In a preprint that is currently on Chemrxiv, Capecchi and coworkers describe a new molecular fingerprint that can be used for similarity searching and machine learning. In the paper, the authors demonstrate how their new fingerprint, called MAP4, performs well when used to build classification models. I wanted to see how well the MAP4 fingerprints performed when used to build regression models.

Please note that my intent here is not to be critical of the paper or the authors. I was curious to see how the method would perform with the sorts of models I typically build and thought it might be useful to share my results. I commend the authors for releasing their work as a preprint and for making their code available. In my opinion, this is how science should be done. As usual, all of the code and data that I used to do this analysis is available on GitHub. Rather than detail every step in my analysis here, I'll point the interested reader to the corresponding Jupyter notebook that provides a complete walkthrough.

Note: As Pen pointed out on his blog, there is a bug in the MAP4 code that prevents the generation of folded fingerprints. The version I used in these tests includes Pen's fix for this bug. Thank you, Iwatobipen.

I did the analysis using a set of 24 datasets from ChEMBL that were originally included in a 2019 paper by Andreas Bender's group. These datasets contain between 203 and 5,207 molecules with associated IC50 data. The evaluation method I used was pretty simple. 

For each dataset:

  • Calculate MAP4 fingerprints
  • Calculate Morgan fingerprints using the RDKit (radius=2)
  • For 10 cross-validation folds:
    • Use scikit learn to split the dataset into training and test sets
    • Build and test an XGBoost regression model with MAP4 fingerprints
      • Build an XGBoost regression model with MAP4 fingerprints
      • Use this model to predict the activity of the test set molecules
      • Calculate R**2 between predictions and experimental values for the test set
    • Build and test an XGBoost regression model with Morgan fingerprints
      • Build an XGBoost regression model with Morgan fingerprints
      • Use this model to predict the activity of the test set molecules
      • Calculate R**2 between predictions and experimental values for the test set
After going through the above procedure for each of the 24 datasets, we have data that we can use to generate a boxplot of the R**2 values for each method across the 10 folds.  As can be seen below,  the models generated using the Morgan fingerprints tend to produce higher R**2 values than the models that use the MAP4 fingerprints. 
In order to be a bit more quantitative about this, we will calculate the effect size for each of the comparisons.  If you read my previous blog post on the topic, you'll recall that the effect size, Cohen's d, is defined as the mean of the differences between methods, divided by the standard deviation of the differences. 

We can use the table below to understand the magnitude of the effect size. 
  • Small Effect Size: d=0.20
  • Medium Effect Size: d=0.50
  • Large Effect Size: d=0.80


In the figure below, we plot the effect size for each of the comparisons. Lines are drawn at 0.2, 0.5, and 0.8 indicating small medium and large effect size. As we can see, the effect size is always negative, indicating that the MAP4 fingerprints do not perform as well as the Morgan fingerprints. In addition, in 20 of 24 cases, the effect size is less than -0.8 (large). 

Based on the data I've generated so far, I'll probably stick with the Morgan fingerprints for my regression models. 

Comments

Post a Comment

Popular posts from this blog

Generative Molecular Design Isn't As Easy As People Make It Look

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)