Showing posts from April, 2019

Clustering 2.1 Million Compounds for $5 With a Little Help From Amazon & Facebook

In this post, I'll talk about how we can use FAISS , an Open Source Library from Facebook, to cluster a large set of chemical structures.   As usual, the code associated with this post is on GitHub .  As I wrote in a previous pos t, K-means clustering can be a useful tool when you want to partition a dataset into a predetermined number of clusters.   While there are a number of tricks for speeding up k-means (also mentioned in the previous post), it can still take a long time to run when the number of clusters, or the number of items being clustered, is large. One of the great things about Cheminformatics these days is that we can take advantage of advances in other fields.  One such advance comes in the form of a software library called FAISS that was released as Open Source by Facebook.  FAISS is a library of routines for performing extremely efficient similarity searches.  While many of us think about similarity searches in terms of chemical fingerprints, similar techniques