Posts

We Need Better Benchmarks for Machine Learning in Drug Discovery

Image
Most papers describing new methods for machine learning (ML) in drug discovery report some sort of benchmark comparing their algorithm and/or molecular representation with the current state of the art.  In the past, I’ve written extensively about statistics and how methods should be compared .  In this post, I’d like to focus instead on the datasets we use to benchmark and compare methods.  Many papers I’ve read recently use the MoleculeNet dataset, released by the Pande group at Stanford in 2017, as the “standard” benchmark.   This is a mistake.  In this post, I’d like to use the MoleculeNet dataset to point out flaws in several widely used benchmarks.  Beyond this, I’d like to propose some alternate strategies that could be used to improve benchmarking efforts and help the field to move forward.   To begin, let’s examine the MoleculeNet benchmark, which to date, has been cited more than 1,800 times.   The MoleculeNet collection consists of 16 datasets divided into 4 categories.  Qua

A Simple Tool for Exploring Structural Alerts

 When working in drug design, we often need filters to identify molecules containing functional groups that may be toxic, reactive, or could interfere with an assay.  A few years ago , I collected the functional group filters available in the ChEMBL database and wrote some Python code that made applying these filters to an arbitrary set of molecules easy.  This functionality is available in the pip installable useful_rdkit_utils package that's available on PyPI and GitHub.   Applying these filters is easy.  If we have a Pandas dataframe with a SMILES column, we can do something like this.  import useful_rdkit_utils as uru reos = uru.REOS("BMS")  #optionally specify the rule set to use df[['rule_set','reos']] = df.SMILES.apply(reos.process_smiles).tolist() This adds two columns, rule_set , and reos , to the dataframe with the name of the rule_set and the name of the rule matched by each molecule.  If the molecule doesn't match any rules, both columns

Getting Real with Molecular Property Prediction

Image
Introduction If you believe everything you read in the popular press, this AI business is easy. Just ask ChatGPT, and the perfect solution magically appears. Unfortunately, that's not the reality. In this post, I'll walk through a predictive modeling example and demonstrate that there are still a lot of subtleties to consider. In addition, I want to show that data is critical to building good machine learning (ML) models. If you don't have the appropriate data, a simple empirical approach may be better than an ML model.  A recent paper from Cheng Fang and coworkers at Biogen presents prospective evaluations of machine learning models on several ADME endpoints.  As part of this evaluation, the Biogen team released a large dataset of measured in vitro assay values for several thousand commercially available compounds.  One component of this dataset is 2,173 solubility values measured at pH 6.8 using chemiluminescent nitrogen detection (CLND), a technique currently consider

Using Counterfactuals to Understand Machine Learning Models

Image
While machine learning (ML) models have become integral to many drug discovery efforts, most of these models are "black boxes" that don't explain their predictions.  There are several reasons we would like to be able to explain a prediction.  Provide scientific insights that will guide the design of new compounds.  Instill confidence among team members.  As I've said before, a computational chemist only has two jobs; to convince someone to do an experiment and to convince someone not to do an experiment.  These jobs are much easier when you can explain the "why" behind a prediction.  Debugging and improving models.  Improving a model is easier if you can understand the rationale behind a prediction.   As I wrote in a previous post ,  identifying and highlighting the molecular features that drive an ML prediction can be difficult.  One recent promising approach is the counterfactuals method published by Andrew White's group  at the University of Rochester