Showing posts from May, 2020

Using the Structure-Activity Landscape Index (SALI) to Analyze Data From the SARS-CoV-2 MPro Screen

Last week the good folks at the COVID Moonshot Project released their first set of screening data for compounds designed based on the fragment crystal structures released by the Diamond Light Source .  Like many others in the community, I was eager to see the data.  I thought it might be useful to share some of my thoughts and some of the techniques that I use to sift through screening data and decide what to do next.   All of the code used to perform the analyses in this post is available in a Jupyter notebook on GitHub . Introduction In this post, I'm going to focus on a metric known as the Structure-Activity Landscape Index (SALI) .  This technique was first published in 2008 by Rajarshi Guha and Jonn Van Drie and provides a simple means of identifying pairs of compounds where a small change in chemical structure brings about a large change in biological activity or physical properties.  These changes can often help us to identify the parts of the molecule that are most imp

Some Thoughts on Comparing Classification Models

I end up reviewing a lot of papers on applications of machine learning in drug discovery, and many of these papers are quite similar.  The authors will use one or more datasets to compare the performance of a few different predictive models.  For instance, a group may compare the performance of some neural network variant with an established method like Random Forest.  These comparisons invariably use the same plot that shows a bar chart with the mean value of a metric like the ROC AUC across five or ten cycles of cross-validation.  In some, but not all, cases the authors will include a "whisker" showing the standard deviation across the cross-validation cycles.  The authors will then point out that their method has the highest mean AUC and declare victory. At this point, I'll typically write a review where I point out that the authors have failed to perform any sort of statistical analysis to demonstrates that their method is significantly better, or even di