Clustering 2.1 Million Compounds for $5 With a Little Help From Amazon & Facebook
In this post, I'll talk about how we can use FAISS, an Open Source Library from Facebook, to cluster a large set of chemical structures. As usual, the code associated with this post is on GitHub.
As I wrote in a previous post, K-means clustering can be a useful tool when you want to partition a dataset into a predetermined number of clusters. While there are a number of tricks for speeding up k-means (also mentioned in the previous post), it can still take a long time to run when the number of clusters, or the number of items being clustered, is large.
One of the great things about Cheminformatics these days is that we can take advantage of advances in other fields. One such advance comes in the form of a software library called FAISS that was released as Open Source by Facebook. FAISS is a library of routines for performing extremely efficient similarity searches. While many of us think about similarity searches in terms of chemical fingerprints, similar techniques are at the heart of modern commerce. Our characteristics, such as likes, dislikes, and purchases are represented as vectors. Similarity searches among these vectors enable websites to tailor content and point out that people who bought this also bought that. Of course, Facebook would like to do this quickly, thus the motivation behind FAISS.
In addition to similarity search, FAISS also provides routines to perform k-means clustering. As implemented, FAISS can run on either a CPU or a GPU. Having seen the performance increases that can be achieved on a GPU, I was initially drawn to the GPU implementation. Unfortunately, I found it difficult to get the data I was using into GPU memory. This is something I'll spend some more time with in the future, but for now, I have work to do.
Beinglazy pragmatic, I decided to try out the CPU implementation. Fortunately, FAISS scales across multiple CPUs and CPU memory is a lot easier to come by. On AWS, I can rent an m5.24xlarge instance with 96 CPUs for $4.61/hr. That's a lot of CPU! Let's see what we can do with it.
Ok, Here's the problem we want to solve. We have a commercial library of 2.1 million molecules, and we would like to cluster this to 50K clusters. We can break this problem down into two steps. The first step is to generate the fingerprints and write them to disk.
Step 1 - Creating a Fingerprint File
In a previous post, I used the parquet library to store a Pandas dataframe containing fingerprints to disk. After some experimentation, I found that this approach was pretty slow when the dataset got bigger. For datasets containing more than a million fingerprints, I found that storing the data in an HDF5 file was much faster, the pyhd5 library makes this pretty easy. The other nice aspect to the HDF5 files is that it's easy to integrate other information like SMILES strings and molecule names into the same file.
The code in the GitHub repo has a python script called gen_fp.py which will generate fingerprints and write them to an HDF5 file. I just went with a simple 256 bit Morgan fingerprint as the default. It's easy enough to change this if you want to use something else. Using the script is pretty simple.
gen_fp.py test2100K.smi test2100K.h5
It took me 11.9 minutes to generate fingerprints for 2.1 million molecules on a single core.
Step 2 - Clustering 2.1 million molecules into 50,000 clusters
The GitHub repo has another script called faiss_kmeans.py that performs the clustering. Again, the command line is pretty simple.
faiss_kmeans.py test2100K.h5 50000
This clustering will produce two output files.
Coda
A few questions came up on Twitter so I thought I'd answer them here.
1. Why are you doing this?
As I mentioned in my previous blog post on K-means, this is part of my workflow when I select compounds for a screening collection. I want a diverse set of compounds with lots of analogs that I can purchase. My workflow typically goes like this.
As I wrote in a previous post, K-means clustering can be a useful tool when you want to partition a dataset into a predetermined number of clusters. While there are a number of tricks for speeding up k-means (also mentioned in the previous post), it can still take a long time to run when the number of clusters, or the number of items being clustered, is large.
One of the great things about Cheminformatics these days is that we can take advantage of advances in other fields. One such advance comes in the form of a software library called FAISS that was released as Open Source by Facebook. FAISS is a library of routines for performing extremely efficient similarity searches. While many of us think about similarity searches in terms of chemical fingerprints, similar techniques are at the heart of modern commerce. Our characteristics, such as likes, dislikes, and purchases are represented as vectors. Similarity searches among these vectors enable websites to tailor content and point out that people who bought this also bought that. Of course, Facebook would like to do this quickly, thus the motivation behind FAISS.
In addition to similarity search, FAISS also provides routines to perform k-means clustering. As implemented, FAISS can run on either a CPU or a GPU. Having seen the performance increases that can be achieved on a GPU, I was initially drawn to the GPU implementation. Unfortunately, I found it difficult to get the data I was using into GPU memory. This is something I'll spend some more time with in the future, but for now, I have work to do.
Being
Ok, Here's the problem we want to solve. We have a commercial library of 2.1 million molecules, and we would like to cluster this to 50K clusters. We can break this problem down into two steps. The first step is to generate the fingerprints and write them to disk.
Step 1 - Creating a Fingerprint File
In a previous post, I used the parquet library to store a Pandas dataframe containing fingerprints to disk. After some experimentation, I found that this approach was pretty slow when the dataset got bigger. For datasets containing more than a million fingerprints, I found that storing the data in an HDF5 file was much faster, the pyhd5 library makes this pretty easy. The other nice aspect to the HDF5 files is that it's easy to integrate other information like SMILES strings and molecule names into the same file.
The code in the GitHub repo has a python script called gen_fp.py which will generate fingerprints and write them to an HDF5 file. I just went with a simple 256 bit Morgan fingerprint as the default. It's easy enough to change this if you want to use something else. Using the script is pretty simple.
gen_fp.py test2100K.smi test2100K.h5
It took me 11.9 minutes to generate fingerprints for 2.1 million molecules on a single core.
Step 2 - Clustering 2.1 million molecules into 50,000 clusters
The GitHub repo has another script called faiss_kmeans.py that performs the clustering. Again, the command line is pretty simple.
faiss_kmeans.py test2100K.h5 50000
This clustering will produce two output files.
- centers.smi contains the smiles for the most central molecule in each cluster
- detail.csv contains the SMILES, molecule name, distance to the cluster center, and cluster number for each molecule in the input file.
Here's a screenshot of htop running with 96 cores clustering with FAISS, it's a beautiful thing!
Clustering 2.1 million molecules took 22 minutes on 92 cores, not too bad. It's amazing what you can do with on-demand computing and a little bit of open source code.
I haven't spent a lot of time pushing the limits of this method, but a lot more may be possible, stay tuned.
Coda
A few questions came up on Twitter so I thought I'd answer them here.
1. Why are you doing this?
As I mentioned in my previous blog post on K-means, this is part of my workflow when I select compounds for a screening collection. I want a diverse set of compounds with lots of analogs that I can purchase. My workflow typically goes like this.
- Filter to remove molecules with objectionable properties or functionality. See my post on filtering chemical libraries for more information.
- Cluster the remaining molecules using K-means
- Select the molecule from each cluster with the largest number of commercially available analogs.
2. What is the size of the largest/smallest clusters? Or what does the distribution of cluster sizes look like?
The largest cluster has 296 members, the smallest has 1. The arithmetic mean is 42 and the median is 36. The distribution looks like this.
3. Did you like what you saw, ie did the clusters look sensible?
I can't say that I looked at all or even most of the clusters, but the ones I looked at made sense.
Comments
Post a Comment