Showing posts from January, 2019

My Response to Peter Kenny's Comments on "AI in Drug Discovery - A Practical View From the Trenches"

As I've said before, my goal is not to use this blog as a soapbox.  I prefer to talk about code, but I thought I should respond to Peter Kenney's comments on my post,  AI in Drug Discovery - A Practical View From the Trenches.  I wanted to just leave this as a comment on Peter's blog.  Alas, what I wrote is too long for a comment, so here goes.

Thanks for the comments, Pete. I need to elaborate on a few areas where I may have been unclear.

In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion.  I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI.

I should have expanded a bit on the statement that ML is “assigning labels based on data”, a description that I borrowed from Cassie Kozyrkov at Google.  I never meant to imply that I was only talking about classification problems.  The way I think about it, a numeric …

K-means Clustering

In Cheminformatics, we frequently run into situations where we want to select a subset from a larger set of molecules.  K-means clustering, a simple, but often overlooked, technique can provide a useful solution.  Let's look at a couple of situations where we might need to choose a subset from a larger set of molecules.  After that, we'll briefly describe the method and look at how we can apply it using an Open Source implementation that is available on GitHub.

Use Cases Let's say that a vendor has a set of 2 million compounds available for purchase, but we only have the budget to purchase 50,000.  How would we choose the 50,000?  The first thing we would probably want to do is to filter the 2 million to remove compounds that contain undesirable functionality or fall outside a desired property range.  I wrote an earlier blog post on filtering chemical libraries so I won't spend any time on that here.   Let's imagine that we did the filtering and we still have 1.75 m…