Using Counterfactuals to Understand Machine Learning Models

While machine learning (ML) models have become integral to many drug discovery efforts, most of these models are "black boxes" that don't explain their predictions.  There are several reasons we would like to be able to explain a prediction.  Provide scientific insights that will guide the design of new compounds.  Instill confidence among team members.  As I've said before, a computational chemist only has two jobs; to convince someone to do an experiment and to convince someone not to do an experiment.  These jobs are much easier when you can explain the "why" behind a prediction.  Debugging and improving models.  Improving a model is easier if you can understand the rationale behind a prediction.   As I wrote in a previous post ,  identifying and highlighting the molecular features that drive an ML prediction can be difficult.  One recent promising approach is the counterfactuals method published by Andrew White's group  at the University of Rochester

Build a QSAR Model in 8 Lines of Python

This post is just a pointer to a Jupyter notebook. The code is in this git repo.

Getting Inside the Mind of the Medicinal Chemist with Machine Learning

For over two decades, many people, including me , have been writing programs attempting to replicate "medicinal chemistry intuition".   This ability to identify molecules that would be considered drug-like can be valuable in various areas, including purchasing compounds for screening collections and prioritizing molecules output by generative algorithms or other denovo design methods.  Currently, the most widely used approach for evaluating drug-likeness is the QED method , published by Andrew Hopkins and coworkers at Pfizer in 2012.  QED uses a weighted combination of calculated properties and structural alerts to generate a drug-likeness score for a molecule, with a higher score indicating a more drug-like molecule.  Most recently,  many generative molecular design methods have used QED as part of their objective function.  A new paper from scientists at Novartis and Microsoft presents an alternate approach, called MolSkill, for quantifying drug-likeness.  In this study,

Working With Drug Data from the ChEMBL Database

When working on drug discovery projects, it's handy to have access to a set of chemical structures and associated data for marketed drugs. If you're considering introducing new functionality, someone invariably asks whether that functionality has been used in a marketed drug. It's also helpful to compare the properties of a new compound or compounds to those of marketed drugs. Early in my career, I remember a new medicinal chemist asking Josh Boger, the founder of Vertex Pharmaceuticals, what they should do on their first day of work. Boger responded, "read the Merck Index so you can see what a drug is supposed to look like". Recently a few papers have been published showing how the properties of drugs have changed over time. I thought it might be helpful to create a notebook showing how to extract and clean drug data from ChEMBL and use it for subsequent analysis. The Jupyter notebook is available here on GitHub and can also be run here on Google Colab . In th

Generative Molecular Design - We Need to Raise the Bar

While it's great that we're now seeing papers describing the experimental validation of generative algorithms for molecular design, we need to consider the significance of these findings and put them into the appropriate context.  Over the last five years, we've seen an explosion in the number of papers describing methods for generative molecular design. The 2018 paper by G√≥mez-Bombarelli, which launched the field, has already been cited more than 2,100 times. For those unfamiliar with the area, generative molecular design algorithms learn the distributions and associations of chemical functionality from a training set, then sample these distributions to generate new molecules. This molecule generation task can be coupled with one or more scoring functions to generate molecules meeting a specific objective, such as a predicted binding affinity. These methods can be considered similar in spirit to techniques for generating photorealistic images , art , or text that have be

AI in Drug Discovery 2022 - A Highly Opinionated Literature Review

Here’s a roundup of some of the papers I found interesting in 2022. This list is heavily slanted to my interests, which lean toward the application of machine learning (ML) in drug design.  I’ve added commentary to most of the papers to explain why I found them compelling.  I’ve done my best to arrange the papers according to themes.  If I omitted a paper, please let me know.  I’d be happy to update this summary.   This review ended up being longer than I had anticipated, and there are several topics I didn’t cover.  If I have some time, this post may get a sequel.  Contents 1. Are Deep Neural Networks Better for QSAR? 2. Deep Learning Methods Provide New Approaches to Protein-Ligand Docking 3. Protein Structure Prediction - Pushing AlphaFold2 in New Directions 4. Model Interpretability 5. QM Methods 6. Utralarge Chemical Libraries 7. Active Learning 8. Molecular Representation 1. Are Deep Neural Networks Better for QSAR?   Based on papers I read and reviewed in 2022, there seems to be

Mining Ring Systems in Molecules for Fun and Profit

I've been a longtime fan of Peter Ertl's work on identifying and classifying the ring systems in molecules.  I wanted a Python implementation for some of my work, so I coded something similar in spirit to what Peter has published.  In this post, I begin by highlighting some of Peter's papers and showing some interesting analyses that can be performed with a tool for extracting ring systems.   After introducing the motivation for the work, we get into the geeky details and explore one approach to identifying ring systems.  Finally, we will look at a simple application of the method and explore the ring systems in marketed drugs.  In an upcoming post, I'll show another, more interesting, application of the method.  The code accompanying this post is in a Jupyter notebook on GitHub .   In addition, the core code for extracting ring systems from molecules has been incorporated into the latest version of my pip installable useful_rdkit_utils package.  I've also incorpora