Working With Drug Data from the ChEMBL Database

When working on drug discovery projects, it's handy to have access to a set of chemical structures and associated data for marketed drugs. If you're considering introducing new functionality, someone invariably asks whether that functionality has been used in a marketed drug. It's also helpful to compare the properties of a new compound or compounds to those of marketed drugs. Early in my career, I remember a new medicinal chemist asking Josh Boger, the founder of Vertex Pharmaceuticals, what they should do on their first day of work. Boger responded, "read the Merck Index so you can see what a drug is supposed to look like". Recently a few papers have been published showing how the properties of drugs have changed over time. I thought it might be helpful to create a notebook showing how to extract and clean drug data from ChEMBL and use it for subsequent analysis. The Jupyter notebook is available here on GitHub and can also be run here on Google Colab . In th

Generative Molecular Design - We Need to Raise the Bar

While it's great that we're now seeing papers describing the experimental validation of generative algorithms for molecular design, we need to consider the significance of these findings and put them into the appropriate context.  Over the last five years, we've seen an explosion in the number of papers describing methods for generative molecular design. The 2018 paper by G√≥mez-Bombarelli, which launched the field, has already been cited more than 2,100 times. For those unfamiliar with the area, generative molecular design algorithms learn the distributions and associations of chemical functionality from a training set, then sample these distributions to generate new molecules. This molecule generation task can be coupled with one or more scoring functions to generate molecules meeting a specific objective, such as a predicted binding affinity. These methods can be considered similar in spirit to techniques for generating photorealistic images , art , or text that have be

AI in Drug Discovery 2022 - A Highly Opinionated Literature Review

Here’s a roundup of some of the papers I found interesting in 2022. This list is heavily slanted to my interests, which lean toward the application of machine learning (ML) in drug design.  I’ve added commentary to most of the papers to explain why I found them compelling.  I’ve done my best to arrange the papers according to themes.  If I omitted a paper, please let me know.  I’d be happy to update this summary.   This review ended up being longer than I had anticipated, and there are several topics I didn’t cover.  If I have some time, this post may get a sequel.  Contents 1. Are Deep Neural Networks Better for QSAR? 2. Deep Learning Methods Provide New Approaches to Protein-Ligand Docking 3. Protein Structure Prediction - Pushing AlphaFold2 in New Directions 4. Model Interpretability 5. QM Methods 6. Utralarge Chemical Libraries 7. Active Learning 8. Molecular Representation 1. Are Deep Neural Networks Better for QSAR?   Based on papers I read and reviewed in 2022, there seems to be

Mining Ring Systems in Molecules for Fun and Profit

I've been a longtime fan of Peter Ertl's work on identifying and classifying the ring systems in molecules.  I wanted a Python implementation for some of my work, so I coded something similar in spirit to what Peter has published.  In this post, I begin by highlighting some of Peter's papers and showing some interesting analyses that can be performed with a tool for extracting ring systems.   After introducing the motivation for the work, we get into the geeky details and explore one approach to identifying ring systems.  Finally, we will look at a simple application of the method and explore the ring systems in marketed drugs.  In an upcoming post, I'll show another, more interesting, application of the method.  The code accompanying this post is in a Jupyter notebook on GitHub .   In addition, the core code for extracting ring systems from molecules has been incorporated into the latest version of my pip installable useful_rdkit_utils package.  I've also incorpora

Clustering Fragment Screening Hits With a Self-Organizing Map

  Clustering Fragment Screening Hits With a Self-Organizing Map (SOM) In a paper, " Fragment binding to the Nsp3 macrodomain of SARS-CoV-2 identified through crystallographic screening and computational docking ", published last year by scientists from UCSF and the Diamond Light Source, the authors reported more than 200 structures of fragments bound to the Nsp3 macrodomain of SARS-CoV-2.   I wanted to dig into the supporting data from this paper and compare fragments that were binding to the same site.  I couldn't find a good tool to do this, so I decided to write one.  After writing the code, I thought that others might find the methodology instructive.   I have a Jupyter notebook with the code on GitHub .  There is also a link to run the notebook on Google Colab .  Here's a quick outline of the workflow.   1.  Download the structures from the PDB   The supporting information for the paper has a list of the PDB ids for the structures.  In addition, I was able to obt

The Solubility Forecast Index

Introduction Recently, I've seen a number of deep learning models designed to predict the aqueous solubility of drug-like molecules.  Despite the advantages brought about by techniques like graph neural networks, I have yet to see a commercial or open-source method that outperforms the venerable Solubility Forecast Index (SFI).  I've written about the challenges associated with predicting aqueous solubility before , so I won't revisit that discussion.  Needless to say, this is a difficult problem.   The SFI, published in 2010 by Alan Hill and Robert Young at GSK, provides a simple, elegant equation for estimating aqueous solubility.   SFI   =   c L og D pH7.4   +   #Ar Where  c L og D pH7.4   is the calculated partition coefficient of all neutral and ionic species of a molecule between pH 7.4 buffer and an organic phase, and #Ar is the number of aromatic rings.  This seems pretty simple and should be easy to calculate.  The number of aromatic rings can be trivially calcul

Useful RDKit Utilities

There's a lot of useful functionality in the RDKit .  My problem is remembering where all of the most useful bits are, and how to use them.  In order to make my life, and perhaps yours, a little easier, I put together a Python package called " useful_rdkit_utils ".  Some of what's in there is simply a repackaging of existing functionality to make it easier to use (at least for me).  In other cases, there are functions I borrowed from elsewhere, and there are a few new ideas introduced.  One interesting component in the library is a REOS class that encapsulates the functionality in the rd_filters package I released a few years ago.   I made the package easy to install.  All you have to do is " pip install useful_rdkit_utils ".  The GitHub repo also has Jupyter notebooks that demonstrate some of the functions in the package.  I'm planning to continue to add to the package, and I'm very open to pull requests with corrections and additions.   This is m