The Solubility Forecast Index

January 17, 2022

Introduction
Recently, I've seen a number of deep learning models designed to predict the aqueous solubility of drug-like molecules. Despite the advantages brought about by techniques like graph neural networks, I have yet to see a commercial or open-source method that outperforms the venerable Solubility Forecast Index (SFI). I've written about the challenges associated with predicting aqueous solubility before, so I won't revisit that discussion. Needless to say, this is a difficult problem.

The SFI, published in 2010 by Alan Hill and Robert Young at GSK, provides a simple, elegant equation for estimating aqueous solubility.

SFI = cLogDpH7.4 + #Ar

Where cLogDpH7.4 is the calculated partition coefficient of all neutral and ionic species of a molecule between pH 7.4 buffer and an organic phase, and #Ar is the number of aromatic rings. This seems pretty simple and should be easy to calculate. The number of aromatic rings can be trivially calculated using the RDKit. The problem here is with the partition coefficient. While there are commercial LogD calculators, there are no open-source implementations of cLogD that I'm aware of. If you have access to a commercial LogD calculator, I'd recommend using it. If you don't, you can adopt the approach in this paper and build a model from the cLogD data in the ChEMBL database. It may seem odd to build a model for calculated data, but sometimes we do what we have to do. There are more than 2 million cLogD values in the ChEMBL database, so this should provide a reasonable dataset for building a machine learning (ML) model.

Building a Machine Learning Model for cLogD

1. Download ChEMBL and extract the cLogD Data
In order to build an ML model, we first need to extract the cLogD data from the ChEMBL database. The GitHub repo associated with this post has a script called calc_descriptors.py that uses the chembl-downloader package written by Charles Tapley Hoyt to download the sqlite version of the ChEMBL database.

2. Calculate molecular descriptors
Once the data has been extracted from ChEMBL, we read it into a Pandas dataframe, and calculate the RDKit 2D descriptors. This script uses dask to run the calculations in parallel and get a bit of a speedup. Note that it takes about 7 hours to download ChEMBL and calculate the descriptors. However, fear not, the GitHub repo has a stored version of the cLogD model. You don't have to do this part unless you really want to.

3. Build the ML model with LightGBM
Once we have the descriptors and the cLogD values, it's relatively straightforward to create an ML model to predict cLogD from a set of descriptors. Of course, there's one more catch, we have a large dataset. The ChEMBL CLogD dataset has more than 2 million molecules and associated cLogD values. With most ML algorithms, it would take hours to build this model. Fortunately, LightGBM enables us to build an accurate model from millions of rows of data in a few minutes. The notebook build_logd_model.ipynb provides the code for building and saving the ML model.

4. Calculate SFI and make a pretty plot
Finally, with the cLogD model in hand, it's easy for us to calculate SFI. For me, one of the most useful aspects of the original paper by Hill and Young is the figure like the one below. In this figure, the data is color-coded by aqueous solubility.

Red <30µM
Yellow 30-200µM
Green >200µM

This color-coding provides an estimate of the probability of a molecule being soluble given a specific value of SFI. The notebook provides code to generate a plot that closely replicates the one in the original paper.

Conclusion

I'm hoping that releasing this model and the associated code provides two benefits. First, it hopefully provides a reasonably reliable estimate of aqueous solubility for drug-like molecules. Second, it can act as a baseline for comparison with other ML models. As usual, the code to accompany this post can be found on GitHub.

Search This Blog

Practical Cheminformatics

The Solubility Forecast Index

Comments

Post a Comment

Popular posts from this blog

Generative Molecular Design Isn't As Easy As People Make It Look

We Need Better Benchmarks for Machine Learning in Drug Discovery

Comparing Classification Models - You’re Probably Doing It Wrong