### Predicting Aqueous Solubility - It's Harder Than It Looks

In this post, we're going to talk about aqueous solubility and how we can predict it.  We'll explore a few predictive models and talk about the best ways to compare these models.  We will compare ESOL, a simple empirical method that's been in existence for almost 15 years, with a few machine learning models generated with DeepChem.  As usual, all of the code associated with this post is on GitHub.

There was a recent post on the RDKit mailing list where someone was looking for an implementation of the ESOL solubility prediction method, originally published by John Delaney in 2004.  I've used ESOL in the past and found it somewhat useful.  I didn't have an implementation handy, so I threw some code together and put an RDKit implementation on my GitHub site.  Now that I have this code in hand, it gives me the opportunity to talk about a couple of topics that I think are under appreciated.
• Selecting an appropriate test set
• Comparing regression methods
A Bit of Background
Aqueous solubility is a key factor in drug discovery.  If a compound is not soluble, it will typically be poorly bioavailable, making it difficult to use in in-vivo studies, and ultimately to deliver to patients.  Solubility can even be a problem in early discovery.  In some cases, poorly soluble compounds can create problems for biochemical and cellular assays.

As originally defined by Yalkowsky, aqueous solubility is a function of both lipophilicity and crystal packing forces.  In his widely cited 2001 paper, Yalkowsky defined the General Solubility Equation (GSE).

logS =  0.5 - 0.01(MP - 25) - log Kow

Where logS is the aqueous solubility, MP is the melting point in degrees Celsius and log Kow is the log of the octanol-water partition coefficient (LogP) of the un-ionized species.

So far this sounds pretty simple.  All we need to do to predict solubility is predict LogP and melting point.  We're pretty good at predicting LogP.  Work on predicting LogP goes back more than 50 years to the work of Hansch and Fujita.  We have a large body of experimental LogP data on drug-like compounds, and we appear to be quite capable of building reliable models.  Predicting melting point is another story.  To my knowledge, we simply don't have reliable methods of predicting melting point from chemical structure.  In order to effectively predict melting point, we would probably have to know how a molecule stacks in a crystal structure.  While the field of crystal structure prediction has made some progress, the technique is far from routine.  Some of the best structure prediction methods can require months to generate a prediction for a single molecule. The problem is further compounded by the fact that molecules can be crystallized in different crystal forms, better known as polymorphs.  These polymorphs can have dramatically different melting points, and as a result, very different solubilities.  For instance, this paper reports aqueous solubility for 4 different forms for Diflunisal.  Note that there is a 100-fold difference in solubility between the most and least soluble forms.

Does My Test Set Represent My Data?
We can't easily use the GSE to predict the solubility of new compounds where we don't have the melting point, but there are alternate approaches we can take.  There is quite a bit of solubility data in the literature, most of which was originally curated by the Yalkowsky group in their AquaSol database.  We should be able to use this data to generate a predictive model.  In fact, over the last 20 years, several groups have used public datasets to generate models for aqueous solubility.  As examples see the work of Delaney, Huuskonen, and Jorgensen.  These groups have used a variety of descriptors and modeling methods to generate models with impressive correlations.  The question is whether these correlations reflect what will be observed in practice.

The plot below shows the distribution of LogS (log of the molar solubility) for the Delaney set, the distributions for the Huuskonen and Jorgensen work are similar.

This dataset spans a wide dynamic range, with the extremes spanning more than 13 logs.  Having a broad dynamic range makes it much easier to achieve a higher correlation with a regression model.   The problem is that the measured solubilities of molecules we encounter in a drug discovery program don't span such a broad dynamic range.  In a drug discovery program, we will typically have compounds with measured solubility values between 1 uM and 1 mM (the red box shown in the plot above).  Our data often only spans 2 logs, sometimes 3 if we're fortunate.  Correlations calculated from data with such a narrow dynamic range will be considerably lower than those reported in the literature.

If we're going to validate our model for predicting solubility, we should do so with a reliable dataset that represents pharmaceutically relevant compounds.  One such dataset is the DLS-100 dataset published by John B.O. Mitchell and co-workers at the University of St Andrews.  This dataset was collected by a single group, using a consistent, highly accurate, experimental method.  Anyone interested in solubility prediction, or measurement, should check out this paper by Palmer and Mitchell.   The authors point out a number of glaring discrepancies between their measurements and those reported in the literature.  The results in this paper also cast some doubt on the validity of the GSE.

Comparing Predictive Models
In this section, we will build a few models for predicting solubility and compare their performance.  Since the Delaney dataset is large and has been widely applied, we'll use this as a training set.  We can then use the DLS-100 set to validate the model.  Of course, as usual, life isn't simple.  It turns out that 44 of the compounds in the DLS-100 set are also in the Delaney set.  We don't want to train on the validation set so we will use 56 compounds from the DLS-100 set as our validation set.  This set is in the file dls_100_unique.csv in the GitHub repo.

First let's look at ESOL, which is a relatively simple empirical model for aqueous solubility published by Delaney in 2004.  The model, which was built based on literature data augmented by data collected at Syngenta, is represented by this relatively simple equation.

LogS = 0.16 -  0.63 cLogP - 0.0062 MW + 0.066 RB - 0.74 AP

Where LogS is the log of the aqueous solubility, cLogP is the octanol-water partition coefficient as calculated by the Daylight toolkit, MW is the molecular weight, RB is the number of rotatable bonds and AP is the aromatic proportion (number of aromatic atoms / total number of heavy atoms).  Delaney reported an R2 of 0.74, pretty reasonable.  In addition, the equation above tends to make intuitive sense.  Given the above equation, an implementation using the RDKit is easy.  The RDKit provides implementations of all of the descriptors above, however, the definitions in the RDKit are probably a little different from what Delaney used.  As such, I refit the linear model to generate a new set of coefficients, using data from the Delaney supporting material (in the file delaney.csv in the GitHub repo).

LogS = 0.26 -  0.74 LogP - 0.0066 MW + 0.0034 RB - 0.42 AP

These coefficients are relatively similar to those reported by Delaney, with the largest variance in the rotatable bond term.  Believe it or not, there are often large differences in the way that people calculate the number of rotatable bonds.  I should probably look into this a bit more when I have some time.  The code to refit and run the ESOL model is in the file esol.py.

As I mentioned above, I've been spending some time lately building models with the DeepChem package.  DeepChem has a tremendous amount of useful functionality, and  I'm planning on devoting a number of future posts to applications of DeepChem in Cheminformatics.  Fortunately, the DeepChem distribution has a number of examples of solubility models built on the Delaney dataset.  It was relatively straightforward to hack these examples and generate the script solubility_comparison.py, which is in the GitHub repo.  This script trains models on the delaney.csv dataset then tests the models on the dls_100_unique.csv set.  The script builds three different models.

• Random Forest is an ensemble decision tree method that supports both classification and regression.  Since the 2003 publication by the Merck group, this has become a standard for building QSAR models.  The method has been widely implemented and seems to be included as a reference in most papers describing novel QSAR methodologies. The 1024 bit Morgan fingerprints from the RDKit were used as the molecular descriptors.  These fingerprints, which are similar to the popular ECFP, seem to be becoming a standard in the RDKit community.
• Weave is a deep neural network method originally published by Steven Kernes and colleagues at Stanford and Google.  In this method, a molecule is described as a set of "Weave Modules" created by combining (weaving) network layers describing atom features and pair features.  This architecture enables the network to learn more general features of molecules, which will hopefully facilitate generalization of the model.
• Graph Convolutions are another deep learning method inspired by work on image classification.  In the DeepChem implementation, atom-based properties calculated using the RDKit and local graph neighborhoods are combined to create a molecule representation.  This representation is then used with a neural network to generate predictions.
Now that we have all of the pieces in place, we can run the script solubility_comparison.py.  This will generate the file solubility_comparison.csv, which has the experimental and predicted solubilities for the 56 compounds in dls_100_unique.csv.  We can then use the script analyze_results.py to generate the histogram below, which compares the Pearson R2 obtained by comparing the predicted and experimental activities.

Looking at the plot above, we can conclude that ESOL is the best method and a simple linear model outperforms a deep learning model, right?  NO!  Wrong on both counts.  First, ESOL isn't really a simple linear model, the LogP term is parameterized based on a large body of data.

Even more importantly, we have to realize that there are confidence intervals associated with correlation coefficients.  A Pearson correlation coefficient has an associated error, which is a function of both the correlation coefficient and the number of data points.  We will be much more confident in a correlation involving 100 data points than we would in a correlation involving 10 points.  Similarly, we would be more confident in a correlation with a Pearson R2 of 0.8 than we would of a correlation with an R2 of 0.3.  I think the plot below does a good job of conveying the relationship between correlation (as Pearson R2), number of points and the 95% confidence interval.  In the plot below, the 95% confidence interval around R2 is shown as the gray area.  You can see that for smaller numbers of points or a lower values of R2, the confidence interval is wide.  As the correlation and the number of datapoints increases, the confidence interval narrows.  I like this plot so much that I've been thinking about getting t-shirts made, maybe then people will get the message.

I read papers every week where people miss this very important point.  In these papers, the authors run some well accepted method and get a correlation.  They then run their new method, get a slightly larger correlation, and declare victory.  This is just wrong.  If you report correlations, you should report the confidence intervals around those correlations.  No exceptions, reviewers take note.

About a year ago, I released an Open Source package called metk (the model evaluation toolkit).  Among other things, metk contains a set of routines for calculating the confidence interval around a Pearson correlation coefficient.   The figure below was generated using the analyze_results.py script from the GitHub repo associated with this post.  This script uses one of the functions from the metk to calculate confidence intervals for the correlations obtained with the four methods used here.  In the plot below, I've added error bars to the histogram shown above.  As you can see, the error bars overlap and we can't say that one method is superior to another.

Note: The error bars were drawn incorrectly in an earlier version of this post.  This has been corrected and doesn't change the overall conclusion.  The error bars still overlap.

Conclusions
As usual. there's a lot more to say on the topic, but I've probably already rambled on enough.  This is a blog, not a textbook (at least I think so).  I hope you'll take a couple of things away from this rant.

• The correlations we obtain with "standard" datasets don't necessarily reflect how those models will perform on "real" datasets.
• Correlations have confidence intervals.  If you are going to report a correlation, report the associated confidence interval. 1. Really interesting post and clearly applicable to many models. Just thinking about solubility, In drug discovery I've never come across a situation where a compound is too soluble, so perhaps it might be more useful to build a model to simply flag molecules that may have very, very poor solubility?

1. You can certainly build a classification model that will (hopefully) identify insoluble compounds. In practice, I've found that people tend to prefer regression models which can provide a trajectory between where you are and where you want to go. I'm planning blog about building, and evaluating the performance of, classification models in the near future.

2. Definitely an interesting perspective. But I am always surprised by the low level of physicochemical insight which people from the "brute force QSAR" use for building their models. I am teaching my students that one should first tink a bit about the property to predict, and what kinds of descriptors may make sense at all. O.K, you at least anayzed the complication from crystal structure, which makes solubility really hard to predict. Other problems may be the hydrate crystals, i.e. that some compüounds may not crystalize to the pure crystal but to hydrates. A big problem for all QSAR is the intrinscally non-linear character of log-solubility, because in one of the two phases the molecule indluences its free energy not only as solute, but also as the solvent (or crystal) embedding the molecules. This is very different from the situation for logP, where the siolute is in infinte dilution on both side and only is solute. Regarding the suitability of descriptors: I tech my students that almost every QSAR for a physicochemical property using MW as a desriptor shows that people used the blind approach. The molecular weight almost never really influences the property. Gravity or inertia play no role for almost all the properties. Hence MW can only correlate accidentally, because it is typically itself somehow correlated to size, e.g. volume of surface area.

3. Excellent post. Rot bond counts are very variable (I've seen amide bonds counted as rotatable...). The rot bond definitions (SMARTS) used in the ESOL method were:-

[!X1]-,=[\$([C;X4])]-&!@[\$([C;X4])]-[!X1]
[!X1]:c-&!@[\$([C;X4])]-[!X1]
[!X1]-,=C-&!@[\$([N;X4])]-[!X1]
[!X1]-[\$([C;X4])]-&!@[\$([N;X3])]-[!X1]
[!X1]-[\$([C;X4])]-&!@[\$([O;X2])]-[!X1]

There was a bit of Fortran wrapped around to dereplicate for the count.

4. So what do we conclude if we cannot say that none of the methods are significantly better? If we need to use one particular model out of the four, which one should we choose?

1. If the performance of all of the methods is roughly equivalent, I'd go with the one I can best understand. In many cases this is the simplest of the available models. In this case, ESOL is a simple linear model and it's easy to look at coefficients as well as the descriptor values and understand the reasoning behind a prediction. If you are interested in how particular substructures contribute to solubility, you may want to consider a method like RandomForest, which can provide a degree of interpretability. However, note that as Bob Sheridan pointed out in a recent paper, structural interpretation of models is tricky. https://pubs.acs.org/doi/10.1021/acs.jcim.8b00825

2. Thank you! What if it was a comparison between models of the same algorithm but with different sets of descriptors or fingerprints?