Some Thoughts on Biotech vs Pharma for Computational Chemists

A recent editorial by Dean Brown in J Med Chem and follow-up posts by Keith Hornberger and Derek Lowe prompted me to think about how we train computational chemists and cheminformaticians for careers in drug discovery. It also brought to mind some unique differences between how computational chemistry is practiced in biotech and pharma. For those who haven’t read Dean Brown’s editorial and the subsequent reactions, I’d highly recommend them. In short, the authors focused on how medicinal chemists were trained in the past and how biotech and the growth of outsourcing are changing that model. Traditionally, most medicinal chemists received academic training in organic synthesis labs and then learned medicinal chemistry on the job from more experienced colleagues. Chemists would typically start at the bench and gradually transition to roles where they led groups and/or drug discovery project teams. With the rise of smaller biotechs and the advent of chemistry outsourcing, many me

Comparing Classification Models - You’re Probably Doing It Wrong

In my last post , I discussed benchmark datasets for machine learning (ML) in drug discovery and several flaws in widely used datasets.  In this installment, I’d like to focus on how methods are compared.  Every year, dozens, if not hundreds, of papers present comparisons of ML methods or molecular representations.  These papers typically conclude that one approach is superior to several others for a specific task.  Unfortunately, in most cases, the conclusions presented in these papers are not supported by any statistical analysis. I thought providing an example demonstrating some common mistakes and recommending a few best practices would be helpful.  In this post, I’ll focus on classification models.  In a subsequent post, I’ll present a similar example comparing regression models.  The Jupyter notebook accompanying this post provides all the code for the examples and plots below.  If you’re interested in the short version, check out the Jupyter notebook. If you want to know what’s

We Need Better Benchmarks for Machine Learning in Drug Discovery

Most papers describing new methods for machine learning (ML) in drug discovery report some sort of benchmark comparing their algorithm and/or molecular representation with the current state of the art.  In the past, I’ve written extensively about statistics and how methods should be compared .  In this post, I’d like to focus instead on the datasets we use to benchmark and compare methods.  Many papers I’ve read recently use the MoleculeNet dataset, released by the Pande group at Stanford in 2017, as the “standard” benchmark.   This is a mistake.  In this post, I’d like to use the MoleculeNet dataset to point out flaws in several widely used benchmarks.  Beyond this, I’d like to propose some alternate strategies that could be used to improve benchmarking efforts and help the field to move forward.   To begin, let’s examine the MoleculeNet benchmark, which to date, has been cited more than 1,800 times.   The MoleculeNet collection consists of 16 datasets divided into 4 categories.  Qua

A Simple Tool for Exploring Structural Alerts

 When working in drug design, we often need filters to identify molecules containing functional groups that may be toxic, reactive, or could interfere with an assay.  A few years ago , I collected the functional group filters available in the ChEMBL database and wrote some Python code that made applying these filters to an arbitrary set of molecules easy.  This functionality is available in the pip installable useful_rdkit_utils package that's available on PyPI and GitHub.   Applying these filters is easy.  If we have a Pandas dataframe with a SMILES column, we can do something like this.  import useful_rdkit_utils as uru reos = uru.REOS("BMS")  #optionally specify the rule set to use df[['rule_set','reos']] = df.SMILES.apply(reos.process_smiles).tolist() This adds two columns, rule_set , and reos , to the dataframe with the name of the rule_set and the name of the rule matched by each molecule.  If the molecule doesn't match any rules, both columns

Getting Real with Molecular Property Prediction

Introduction If you believe everything you read in the popular press, this AI business is easy. Just ask ChatGPT, and the perfect solution magically appears. Unfortunately, that's not the reality. In this post, I'll walk through a predictive modeling example and demonstrate that there are still a lot of subtleties to consider. In addition, I want to show that data is critical to building good machine learning (ML) models. If you don't have the appropriate data, a simple empirical approach may be better than an ML model.  A recent paper from Cheng Fang and coworkers at Biogen presents prospective evaluations of machine learning models on several ADME endpoints.  As part of this evaluation, the Biogen team released a large dataset of measured in vitro assay values for several thousand commercially available compounds.  One component of this dataset is 2,173 solubility values measured at pH 6.8 using chemiluminescent nitrogen detection (CLND), a technique currently consider