Multiple Comparisons, Non-Parametric Statistics, and Post-Hoc Tests

In Cheminformatics, we frequently run into situations where we want to compare more than two datasets.  When comparing multiple datasets, we have a higher probability of chance correlation so we must make a few adjustments to the ways in which we compare our data.  In this post, we will examine the impact of multiple comparisons and talk about methods known as post-hoc tests that can be used to correct p-values when multiple comparisons are performed.  We will also make a brief foray into non-parametric statistics, a technique that is appropriate for dealing with the skewed data distributions that we often encounter in drug discovery.As usual, all of the data and the code used to perform the analyses in this post is available on GitHub.  
I always find it easier to understand a method when I’m presented with an example.  For this case, we’ll look at a 2011 paper from Li and coworkers.In this paper, the authors compared the impact of nine different charge assignment methods on the perfo…

Plotting Distributions

In Cheminformatics, we often deal with distributions.  We may want to look at the distributions of biological activity in one or more datasets.  If we're building predictive models, we may want to look at the distributions of prediction errors.   A recent Twitter thread on "dynamite plots", and the fact that they should be banned from all scientific discourse got me thinking about a few different ways to look at distributions.  I thought it might be useful to put together some code examples showing how one can plot distributions.   As usual, all of the code I used for analysis and plotting can be found on GitHub.

Dynamite Plots (don't use these)
To demonstrate this concept, we'll look at the dataset published by Lee et al based on an earlier publication by Wang et al, that I used in my last post.  This dataset consists of molecules which had a range of activities against eight different targets.  One way to plot this data is to use the classic dynamite plot.   The…

Some Thoughts on Evaluating Predictive Models

I'd like to use this post to provide a few suggestions for those writing papers that report the performance of predictive models.  This isn't meant to be a definitive checklist, just a few techniques that can make it easier for the reader to assess the performance of a method or model.

As is often the case, this post was motivated by a number of papers in the recent literature.  I'll use one of these papers to demonstrate a few things that I believe should be included as well as a few that I believe should be avoided.  My intent is not to malign the authors or the work, I simply want to illustrate a few points that I consider to be important.  As usual, all of the code I used to create this analysis is in GitHub
For this example, I'll use a recent paper in ChemRxiv which compares the performance of two methods for carrying out free energy calculations.  
1. Avoid putting multiple datasets on the same plot, especially if the combined performance is not relevant.  In t…

My Response to Peter Kenny's Comments on "AI in Drug Discovery - A Practical View From the Trenches"

As I've said before, my goal is not to use this blog as a soapbox.  I prefer to talk about code, but I thought I should respond to Peter Kenney's comments on my post,  AI in Drug Discovery - A Practical View From the Trenches.  I wanted to just leave this as a comment on Peter's blog.  Alas, what I wrote is too long for a comment, so here goes.

Thanks for the comments, Pete. I need to elaborate on a few areas where I may have been unclear.

In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion.  I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI.

I should have expanded a bit on the statement that ML is “assigning labels based on data”, a description that I borrowed from Cassie Kozyrkov at Google.  I never meant to imply that I was only talking about classification problems.  The way I think about it, a numeric …

K-means Clustering

In Cheminformatics, we frequently run into situations where we want to select a subset from a larger set of molecules.  K-means clustering, a simple, but often overlooked, technique can provide a useful solution.  Let's look at a couple of situations where we might need to choose a subset from a larger set of molecules.  After that, we'll briefly describe the method and look at how we can apply it using an Open Source implementation that is available on GitHub.

Use Cases Let's say that a vendor has a set of 2 million compounds available for purchase, but we only have the budget to purchase 50,000.  How would we choose the 50,000?  The first thing we would probably want to do is to filter the 2 million to remove compounds that contain undesirable functionality or fall outside a desired property range.  I wrote an earlier blog post on filtering chemical libraries so I won't spend any time on that here.   Let's imagine that we did the filtering and we still have 1.75 m…

AI in Drug Discovery - A Practical View From the Trenches

It has never been my intent to use this blog as a personal soapbox, but I feel the need to respond to a recent article on AI in drug discovery.  
A recent viewpoint by Allan Jordan in ACS Medicinal Chemistry Letters suggests that we are nearing the zenith of the hype curve for Artificial Intelligence (AI) in drug discovery and that this hype will be followed by an inevitable period of disillusionment.   Jordan goes on to discuss the hype created around computer-aided drug design and draws parallels to current work to incorporate AI technologies in drug discovery.  While the author does make some reasonable points, he fails to highlight specific problems or to define what he means by AI.   This is understandable.  While the term AI is used frequently, most available definitions are still unclear.  Wikipedia defines AI as “ intelligence demonstrated by machines", not a particularly helpful phrase.  We wouldn’t consider a person who can walk around a room without bumping into thing…

Self-Organizing Maps - 90s Fad or Useful Tool? (Part 1)

In this post, I will explain how self-organizing maps (SOMs) work.  In the first part of this post, I'll explain the technological underpinnings of the technique.  If you're impatient and just want to get to the implementation, skip to part 2.

A few years ago I was having a discussion with a computational chemistry colleague and the topic of self-organizing maps (SOMs) came up.   My colleague remarked, "weren't SOMs one of those 90s fads, kind of like Furbys"?  While there were a lot of publications on SOMs in the early 1990s, I would argue that SOMs continue to be a useful and somewhat underappreciated technique.

What Problem Are We Trying to Solve?

In many situations in drug discovery, we want to be able to arrange a set of molecules in some sort of logical order.  This can be useful in a number of cases.
Clustering.  Sometimes we want to be able to put a set of molecules into groups and select representatives from each group.  This may be the case when we only h…