Wicked Fast Cheminformatics with NVIDIA RAPIDS

Graphics Processing Units (GPUs) have revolutionized scientific computing.  Scientists have been using GPUs to achieve significant speed-ups in fields ranging from molecular dynamics to machine learning.  Unfortunately, programming GPUs is a rather painful process that requires considerable expertise. Fortunately for those of us who'd prefer to forgo the travails of CUDA programming, NVIDIA has released the RAPIDS library, which makes it easy to perform a wide array of data science operations on a GPU.  In this post, I'll present a few examples of how we can use RAPIDS to speed-up a few tasks that we commonly perform in Cheminformatics.  As usual, a Jupyter notebook containing all of the code associated with this post is available on GitHub.

2020 -06-23 I made a couple of changes to the code that slightly changed the runtimes and the trustworthiness values for t-SNE.  The conclusions are the same, RAPIDS ROCKS!

I've been following RAPIDS since its initial public release and the installation process has gone from incredibly painful to very easy.  Here's how I installed the RDKit and RAPIDS into a conda environment.

1. Install the RDKit into a new conda environment and activate the environment.
conda create -c rdkit -n rdkit_2020_06
conda activate rdkit_2020_06

2.  Check the versions of Python and CUDA.
python --version

3. Use the handy RAPIDS configuration tool to set up a conda install command to fit my versions of Python and CUDA.
conda install -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids=0.14 python=3.6 cudatoolkit=10.1

If you're running a Jupyter notebook on AWS, you'll probably need to execute this command before starting your notebook for the first time. 
pip install environment_kernels

At the heart of RAPIDS is a data structure known as the CUDA dataframe or cudf.   One can think of a cudf as a Pandas dataframe that lives on a GPU.    This functionality is in a library called cudf. It's trivial to create a cudf from a Pandas dataframe.

import cudf
import pandas as pd
df = pd.read_csv("chembl_100k.smi.gz",sep=" ",names=["SMILES","Name"])
cu_df = cudf.from_pandas(df)

In the examples we'll be looking at here, we'll primarily be dealing with chemical fingerprints in a numpy array.  The RAPIDS examples contain a handy function for converting a numpy array to a cudf.

def np2cudf(arr):
    # convert numpy array to cudf dataframe
    df = pd.DataFrame({'fea%d'%i:arr[:,i] for i in range(arr.shape[1])})
    pdf = cudf.DataFrame()
    for c,column in enumerate(df):
        pdf[str(c)] = df[column]
    return pdf

Note that this isn't the most memory-efficient way to hold fingerprints in memory.  If you're dealing with millions of molecules, you may run out of GPU memory. 

In addition to speeding up operations such as searching and sorting dataframes, RAPIDS contains optimized implementations of several machine learning algorithms that operate on a cudf.  This functionality is in a Python library called cuml.   In this post, we'll take a look at GPU implementations of a few unsupervised machine learning algorithms that I've covered in previous posts.
K-Means Clustering
As I discussed in my previous post, K-Means clustering provides an efficient means of dividing a set of molecules into a predefined number of clusters.  We'll benchmark the performance of the RAPIDS implementation of k-means by comparing it with the MiniBatch K-Means implementation in scikit-learn.  All of the benchmarks discussed here were run on a p3.2xlarge instance on AWS with an NVIDA V100 GPU.  In the plot below, we compare the time required to cluster 10,000 molecules into 100 to 1,000 clusters.  
As we can see from the plot above, the RAPIDS GPU implementation is considerably faster than the CPU implementation in scikit-learn.  In order to assess the performance differences, we'll plot the ratio of CPU runtime to GPU runtime.   From the plot below we can see that the RAPIDS GPU implementation is 5 to 45 times faster than the sklearn CPU implementation.  
Finally, to get a better idea of performance, we'll run a benchmark with a larger set of 100,000 molecules.  We see that we can cluster 100K fingerprints into 1,000 clusters in less than 8 seconds.  I'm not sure why the run with 700 clusters was slower but I ran this a few times and saw the same behavior.  Note that, for comparison, I tried to run this clustering on a CPU but the clustering was taking hours and I was getting bored.  Datasets of this size should definitely be run on a GPU. 

Visualizing Chemical Space with t-SNE and UMAP
In a previous post, I wrote about t-SNE a dimensionality reduction technique that can be used to visualize the chemical space covered by a set of molecules.  A couple of comments on that post suggested UMAP as an alternative to t-SNE.  Fortunately, both of these algorithms are implemented in RAPIDS.  

As a test, I used t-SNE to map the chemical space for a set of 1,495 drugs from the ChEMBL database.  As above, the RAPIDS implementation was quite a bit faster.  Mapping the 1,495 molecules took 6.7 seconds using the CPU implementation in scikit-learn while the GPU implementation in RAPIDS only took 1 second.  While it's nice that RAPIDS was faster the results don't look the same as those obtained with scikit-learn.  

In order to further test this, I calculated the trustworthiness, a measure of how faithfully the distances in the low dimensional space reproduce the corresponding distances in the high dimensional space.  This value ranges between 0 and 1, with higher values indicating that the projection more faithfully reproduces the distances in the original high dimensional space.  As you can see below, the scikit-learn implementation appears to have a slightly higher value of trustworthiness than the RAPIDS implementation. 

While the RAPIDS implementation of t-SNE is quick,  I need to understand it better before integrating it into my standard workflow. 

I performed a similar comparison on the same ChEMBL drug dataset with a CPU implementation of UMAP and the GPU implementation in RAPIDS.  In this case, the resulting projections and the trustworthiness values were similar.   

Note that we shouldn't compare the trustworthiness between t-SNE and UMAP.  The t-SNE results were calculated on the first 50 principal components of the fingerprints, while UMAP was calculated based on the full 2048 bit fingerprints. 

The fact that the RAPIDS implementation of UMAP is almost 20x faster opens a lot of possibilities.  Among other things, this sort of speed makes it practical to examine the impact of parameter settings on the output of a method. For instance, in UMAP one can specify the number of neighbor molecules used to calculate the map.  Since the RAPIDS implementation is fast, these scans can be performed in seconds.  Here I evaluate the relationship between trustworthiness and the number of neighbors and show that it maxes out with three neighbors. 

We can also scan another parameter, the minimum distance between clusters, and show that it has a minimal impact on UMAP performance. 

In this post, we've looked at a couple of unsupervised machine learning methods that have fast GPU implementations in NVIDIA's RAPIDS package.  The functions in RAPIDS are very easy to use and can provide a simple means of speeding up your Cheminformatics workflows.  However, as we saw with t-SNE, the results can sometimes be less than satisfactory.  It's always good to compare with an established, gold-standard implementation to ensure that your getting the results you expected.  There's a lot more to RAPIDS and I'd urge you to try out some of the demos and adapt them to your work. I'll be covering more on RAPIDS in future posts. 


Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Generative Molecular Design Isn't As Easy As People Make It Look