Clustering Fragment Screening Hits With a Self-Organizing Map

  Clustering Fragment Screening Hits With a Self-Organizing Map (SOM) In a paper, " Fragment binding to the Nsp3 macrodomain of SARS-CoV-2 identified through crystallographic screening and computational docking ", published last year by scientists from UCSF and the Diamond Light Source, the authors reported more than 200 structures of fragments bound to the Nsp3 macrodomain of SARS-CoV-2.   I wanted to dig into the supporting data from this paper and compare fragments that were binding to the same site.  I couldn't find a good tool to do this, so I decided to write one.  After writing the code, I thought that others might find the methodology instructive.   I have a Jupyter notebook with the code on GitHub .  There is also a link to run the notebook on Google Colab .  Here's a quick outline of the workflow.   1.  Download the structures from the PDB   The supporting information for the paper has a list of the PDB ids for the structures.  In addition, I was able to obt

The Solubility Forecast Index

Introduction Recently, I've seen a number of deep learning models designed to predict the aqueous solubility of drug-like molecules.  Despite the advantages brought about by techniques like graph neural networks, I have yet to see a commercial or open-source method that outperforms the venerable Solubility Forecast Index (SFI).  I've written about the challenges associated with predicting aqueous solubility before , so I won't revisit that discussion.  Needless to say, this is a difficult problem.   The SFI, published in 2010 by Alan Hill and Robert Young at GSK, provides a simple, elegant equation for estimating aqueous solubility.   SFI   =   c L og D pH7.4   +   #Ar Where  c L og D pH7.4   is the calculated partition coefficient of all neutral and ionic species of a molecule between pH 7.4 buffer and an organic phase, and #Ar is the number of aromatic rings.  This seems pretty simple and should be easy to calculate.  The number of aromatic rings can be trivially calcul

Useful RDKit Utilities

There's a lot of useful functionality in the RDKit .  My problem is remembering where all of the most useful bits are, and how to use them.  In order to make my life, and perhaps yours, a little easier, I put together a Python package called " useful_rdkit_utils ".  Some of what's in there is simply a repackaging of existing functionality to make it easier to use (at least for me).  In other cases, there are functions I borrowed from elsewhere, and there are a few new ideas introduced.  One interesting component in the library is a REOS class that encapsulates the functionality in the rd_filters package I released a few years ago.   I made the package easy to install.  All you have to do is " pip install useful_rdkit_utils ".  The GitHub repo also has Jupyter notebooks that demonstrate some of the functions in the package.  I'm planning to continue to add to the package, and I'm very open to pull requests with corrections and additions.   This is m

Picking the Highest Scoring Molecule(s) From Each Cluster

 Here's a quick post based on a conversation with a friend who wanted to be able to cluster a set of docked molecules based on fingerprints and select the highest scoring molecule(s) from each cluster.  As usual, Pandas made this super easy.    The code for this example can be run on Colab  and is also available as a Gist . 

Exploratory Data Analysis With mols2grid and Bemis-Murcko Frameworks

One of the most common tasks in Cheminformatics is exploratory data analysis (EDA).  Given a new dataset, we often need to rapidly explore the chemistry in a set containing hundreds, or even thousands, of molecules.  One useful technique for EDA is the Bemis-Murcko framework .  This technique, originally published by Guy Bemis and Mark Murcko, provides a simple but elegant means of grouping molecules.  Bemis-Murcko frameworks (also known as scaffolds) are created by successively removing monovalent atoms until only ring atoms and linker atoms remain.  There are a few nuances having to do with the removal of exocyclic double bonds and the maintenance of aromaticity, but the method itself is very easy to understand.  There are two versions of the Bemis-Murcko framework, which are sometimes confused.  In the first version, illustrated in the top row of the figure below, monovalent atoms are removed until only ring atoms and linker atoms remain.  In the second version, a generic framework

Similarity Search and Some Cool Pandas Tricks

In this post, we're going to take a look at molecular similarity searches.  Molecular similarity is central to a lot of what we do in Cheminformatics.  It's important for identifying analogs and understanding SAR.  Molecular similarity is also at the core of many clustering methods that we use to understand datasets or design screening libraries.   In this example, we'll be using the chemfp package by Andrew Dalke.  Chemfp has both free and paid tiers.  With the free tier, you can perform similarity searches on smaller datasets, like the one we're using here.  For larger datasets, you need to purchase the paid version.  Chemfp is a great package. If you're using it for production drug discovery, you should buy a license.   In addition to performing searches with chemfp, we'll also go over a few Pandas tricks that will enable us to rapidly process the output from chemfp.  Here's a link to the tutorial notebook on  Google Colab  and on GitHub .  I'd like

Building a multiclass classification model

 A pointer to the fastpages site.