Posts

Using Reaction Transforms to Understand SAR

Image
One of the most effective ways of understanding structure-activity relationships (SAR) is by comparing pairs of compounds which differ by a single, consistent feature.  By doing this, we can often understand the impact of this feature on biological activity or physical properties.  In order to effectively identify these pairs, we sometimes need an automated method that can sift through large datasets.  In this post, we will show how we can use the chemical reaction capabilities in a Cheminformatics toolkit to identify interesting pairs of compounds.   As usual, the code to accompany this post is on GitHub . The idea of comparing the biological activity of pairs of compounds which only differ by a single feature has been a key concept since the beginning of medicinal chemistry.  Over the last 15 years, software tools for generating " matched molecular pairs " (MMP) have become a common component of Cheminformatic analyses.  For those unfamiliar with t...

Where's the code?

Image
Yesterday I posted this tweet, which seemed to resonate with quite a few people.   Ok, at least it got more likes than my tweets usually do. Of course, while writing snarky tweets is fun, it doesn’t do anything to address the underlying problem. Computational Chemistry and Cheminformatics (including applications of machine learning in drug discovery) are a few of the only remaining fields where publishing computational methods without source code is still deemed acceptable.  I spend a lot of time reading the literature and am constantly frustrated by how difficult it is to reproduce published work.   Given the ready availability of repositories for code and data , as well as tools like Jupyter notebooks that make it easy to share computational workflows, this shouldn’t be the case.  I really liked this tweet from Egon Willinghagen , one of the editors of The Journal of Cheminformatics ( J. ChemInf. ).   Shouldn’t we be able to download...

Clustering 2.1 Million Compounds for $5 With a Little Help From Amazon & Facebook

Image
In this post, I'll talk about how we can use FAISS , an Open Source Library from Facebook, to cluster a large set of chemical structures.   As usual, the code associated with this post is on GitHub .  As I wrote in a previous pos t, K-means clustering can be a useful tool when you want to partition a dataset into a predetermined number of clusters.   While there are a number of tricks for speeding up k-means (also mentioned in the previous post), it can still take a long time to run when the number of clusters, or the number of items being clustered, is large. One of the great things about Cheminformatics these days is that we can take advantage of advances in other fields.  One such advance comes in the form of a software library called FAISS that was released as Open Source by Facebook.  FAISS is a library of routines for performing extremely efficient similarity searches.  While many of us think about similarity searches in terms of chemic...

Multiple Comparisons, Non-Parametric Statistics, and Post-Hoc Tests

Image
In Cheminformatics, we frequently run into situations where we want to compare more than two datasets.  When comparing multiple datasets, we have a higher probability of chance correlation so we must make a few adjustments to the ways in which we compare our data.  In this post, we will examine the impact of multiple comparisons and talk about methods known as post-hoc tests that can be used to correct p-values when multiple comparisons are performed.  We will also make a brief foray into non-parametric statistics, a technique that is appropriate for dealing with the skewed data distributions that we often encounter in drug discovery.   As usual, all of the data and the code used to perform the analyses in this post is available on GitHub .   I always find it easier to understand a method when I’m presented with an example.  For this case, we’ll look at a 2011 paper from Li and coworkers .   In this paper, the authors compared the impact...

Plotting Distributions

Image
In Cheminformatics, we often deal with distributions.  We may want to look at the distributions of biological activity in one or more datasets.  If we're building predictive models, we may want to look at the distributions of prediction errors.    A recent Twitter thread on "dynamite plots", and the fact that they should be banned from all scientific discourse got me thinking about a few different ways to look at distributions.  I thought it might be useful to put together some code examples showing how one can plot distributions.   As usual, all of the code I used for analysis and plotting can be found on GitHub . Dynamite Plots (don't use these) To demonstrate this concept, we'll look at the dataset published by Lee et al based on an earlier publication by Wang et al , that I used in my last post .  This dataset consists of molecules which had a range of activities against eight different targets.  One way to plot this data is to use t...

Some Thoughts on Evaluating Predictive Models

Image
I'd like to use this post to provide a few suggestions for those writing papers that report the performance of predictive models.  This isn't meant to be a definitive checklist, just a few techniques that can make it easier for the reader to assess the performance of a method or model. As is often the case, this post was motivated by a number of papers in the recent literature.  I'll use one of these papers to demonstrate a few things that I believe should be included as well as a few that I believe should be avoided.  My intent is not to malign the authors or the work, I simply want to illustrate a few points that I consider to be important.  As usual, all of the code I used to create this analysis is in GitHub .  For this example, I'll use a recent paper in ChemRxiv which compares the performance of two methods for carrying out free energy calculations.   1. Avoid putting multiple datasets on the same plot , especially if the combined per...

My Response to Peter Kenny's Comments on "AI in Drug Discovery - A Practical View From the Trenches"

As I've said before, my goal is not to use this blog as a soapbox.  I prefer to talk about code, but I thought I should respond to Peter Kenney's comments on my post,  AI in Drug Discovery - A Practical View From the Trenches .  I wanted to just leave this as a comment on Peter's blog.  Alas, what I wrote is too long for a comment, so here goes. Thanks for the comments, Pete. I need to elaborate on a few areas where I may have been unclear. In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion.  I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI. I should have expanded a bit on the statement that ML is “assigning labels based on data”, a description that I borrowed from Cassie Kozyrkov at Google.  I never meant to imply that I was only talking about classification problems.  The way...