Getting Real with Molecular Property Prediction


Introduction
If you believe everything you read in the popular press, this AI business is easy. Just ask ChatGPT, and the perfect solution magically appears. Unfortunately, that's not the reality. In this post, I'll walk through a predictive modeling example and demonstrate that there are still a lot of subtleties to consider. In addition, I want to show that data is critical to building good machine learning (ML) models. If you don't have the appropriate data, a simple empirical approach may be better than an ML model. 

A recent paper from Cheng Fang and coworkers at Biogen presents prospective evaluations of machine learning models on several ADME endpoints.  As part of this evaluation, the Biogen team released a large dataset of measured in vitro assay values for several thousand commercially available compounds.  One component of this dataset is 2,173 solubility values measured at pH 6.8 using chemiluminescent nitrogen detection (CLND), a technique currently considered to be state-of-the-art.  This dataset, which I'll refer to as the Biogen Solubility Dataset (BSD), presents an interesting opportunity to evaluate several newer and older methods for predicting aqueous solubility.   In this post, we'll begin with some exploratory data analysis, then use the BSD to assess the performance of several predictive models.  Along the way, I'll highlight some of the tricks, traps, and pitfalls one encounters when dealing with real data.

We'll look at several approaches to predicting aqueous solubility, starting with simple models and progressing to more sophisticated techniques.  

All the code and data used in this analysis are available on GitHub for those who like to play along.  In this case, there's a lot of code. 

Exploratory Data Analysis
Before building models, we should begin by examining the data.   The Biogen dataset in their GitHub repo has the solubility values listed as the log of the solubility in mg/ml.  I'm accustomed to working with solubility data in µM, so I used a bit of freshman chemistry to do the unit conversion.  
The µM solubility was then converted to LogS, the log of the molar solubility. 

Let's begin our exploratory data analysis by looking at the data distribution. 
There are a couple of things to notice here.  First, the dynamic range of the data is small, spanning only 3 logs.  This is much smaller than the dynamic range in the solubility datasets in the MoleculeNet and Therapeutic Data Commons benchmarks which span 10-12 logs.   The BSD is much more representative of the data encountered in drug discovery programs.  The BSD's limited dynamic range also impacts our predictive model evaluation.  A smaller dynamic range typically leads to lower correlation coefficients (harder) and a lower mean absolute error (easier).   Another challenging aspect of this dataset is its skewed distribution; 58% of the LogS values are between -4 and -3.5.  If we simply guessed that the LogS of every compound was -4 (100µM), we would have a mean absolute error (MAE) of 0.46.  Of course, we don't a priori know the distribution of the data, but I find a simple analysis like this helps to put the results of predictive modeling efforts in the appropriate context. 

How Well Should We Be Able to Model the Data? 
An important first step in any modeling exercise is estimating the impact of the experimental error on our predictive models.  One way of doing this is to add simulated experimental error to the data and examine the impact of this simulated error on the correlation.  I described this method in a previous blog post, so I won't go into detail here. When modeling experimental error, we first need an estimate of the standard deviation in the experimental measurements.  While previous publications estimated the experimental error in solubility measurements at 0.6 logs, a more recent paper by Avdeef revised this estimate to 0.17 logs.  In the figure below, I performed 1000 trials, selecting a random LogS value from the BSD and adding normally distributed noise from a normal distribution centered at 0 with standard deviations of 0.17, 0.5, and 0.6.  I then calculated the correlation between the true values and values from the simulation. To be consistent with the Biogen paper, I'll report correlations as Pearson's r. 



As we can see, experimental error can dramatically impact our ability to model the data.  If the experimental error is 0.6 logs, we can't expect a Pearson's r greater than 0.77.

Modeling Aqueous Solubility with ESOL 
John Delaney published one of the simpler methods for predicting aqueous solubility in 2004.    This method, ESOL, was trained using linear regression of molecular descriptors calculated on an experimental dataset containing 2,874 measured solubilities.  The original ESOL paper defined this equation for calculating solubility.

LogS = 0.16 -  0.63 cLogP - 0.0062 MW + 0.066 RB - 0.74 AP
  • LogS - log of the molar aqueous solubility
  • cLogP - calculated log of the octanol-water partition coefficient, a measure of lipophilicity
  • MW - molecular weight
  • RB - number of rotatable bonds
  • AP - aromatic proportion - number of aromatic atoms/total number of atoms
Some of the descriptors used in the original ESOL paper were calculated with software that is no longer available.  To account for differences in descriptor implementations, I refit the ESOL equation using descriptors calculated with the RDKit.  This is described in more detail in a previous post. 

LogS = 0.26 -  0.74 LogP - 0.0066 MW + 0.0034 RB - 0.42 AP

If we use ESOL to predict LogS for the BSD, we observe the correlation shown below.   We can see that the large block of data with experimental LogS between -4 and -3.5 is predicted to be anywhere between -2 and -6.5.   The correlation isn't great, and the mean absolute error (MAE) is 7.6 fold.   The values in brackets are bootstrapped 95% confidence intervals.  





The Solubility Forecast Index
The Solubility Forecast Index (SFI), published in 2010 by Alan Hill and Robert Young at GSK, provides a simple, elegant equation for estimating aqueous solubility.  

SFI = cLogDpH7.4 + #Ar

In the equation above, cLogDpH7.4 is the calculated partition coefficient of a molecule's neutral and ionic species between pH 7.4 buffer and an organic phase, and #Ar is the number of aromatic rings.  For the analysis below, I used an ML model for LogD that I built based on calculated values in the ChEMBL database.  This model is described in more detail in an earlier post.  A recent publication from Duan and coworkers showed good performance on LogD predictions by pre-training a neural network on the ChEMBL data I used and fine-tuning with experimental data.  I'll take a closer look at this approach in a future post. 

One of the most useful aspects of the original paper by Hill and Young is the figure like the one below.   In this figure, the data is color-coded by aqueous solubility.  More soluble molecules are on the left, and less soluble molecules are on the right. 
  • Red  <30µM (LogS < -4.5) 
  • Yellow 30-200µM (LogS -4.5 to -3.7)
  •  Green >200µM (LogS > -3.7)
This color coding estimates the probability of a molecule being soluble given a specific value of SFI.  The figure below shows binned SFI scores for the BSD, colored by experimental solubility.  


Modeling Aqueous Solubility Using the General Solubility Equation
One of the most well-known relationships between aqueous solubility and other physical properties is the General Solubility Equation (GSE), published by Ran and Yalkowsky in 2001.  In this paper, the authors established a method for estimating the aqueous solubility of a compound based on its melting point and octanol-water partition coefficient.  

LogS =  0.5 - 0.01(MP - 25) - log Kow

In the GSE, MP is the melting point in degrees Celsius, and log Kow is the octanol-water partition coefficient.  The primary practical limitation of the GSE is the requirement for two experimentally determined values.   While reliable methods for estimating the octanol-water partition coefficient have existed for over a decade, predicting a molecule's melting point has been challenging.   A recent paper by Zhu attempts to overcome this limitation by using 58,100 experimental melting point values from the Cambridge Crystallographic Data Center (CCDC) to generate an ML model for predicting melting point.   In cross-validation studies,  this model predicted test set melting points with an average error of 30 degrees.  The authors subsequently used their melting point model, coupled with the Crippen LogP model in the RDKit, to predict the solubility of molecules from 6 different test sets, and reported an MAE between 0.85 and 1.18 logs.  How should we expect the GSE to perform with a predicted melting point? If we assume our LogP predictions are perfect, a 30 degree error in melting point should lead to an average error of approximately 0.24 logs.  If we combine this melting point error with the 0.677 log standard deviation reported for the Crippen LogP, we would have an average LogS error of 0.59 logs. 

The figure below shows the correlation between the experimental and GSE-predicted LogS for the BSD.  In this case, the GSE correlation is within the 95% confidence interval, and the MAE is slightly larger than that obtained with ESOL.   We could probably improve the MAE for the GSE approach by refitting the coefficients to another experimental dataset. On this dataset, I'd consider the ESOL and GSE approaches to be roughly equivalent.  Given that both give close to 10-fold error, their practical utility is limited.  


One can argue that it would be more appropriate to use LogD rather than LogP when calculating LogS using the GSE.  On the plus side, LogD captures the ionization state of the molecule.  On the minus side, a LogD calculation involves an error-prone estimate of pKa.   In the interest of completeness, I repeated the GSE calculation with LogD and ended up with a minimal difference in correlation and higher MAE when I used LogD in the GSE.  Again, we may have been able to marginally reduce MAE by refitting the coefficients in the GSE equation. 



Building a Machine Learning Model Using Literature Data 
People new to ML in drug discovery frequently ask me whether generating reliable ML models using literature data is possible.  My noncommittal answer is usually, "It depends".   As practical experience and literature have shown, the performance of a model typically depends on the similarity between the molecules being predicted and the molecules used to train the model.  The BSD presents an interesting opportunity to test this premise.   In this section, we'll use the solubility data available from the Therapeutic Data Commons (TDC) to build an ML model, then use that model to predict the solubility of the molecules in the BSD.  There is a wide range of molecular descriptors and ML algorithms available for model building.  In my experience, while molecular descriptors can impact model quality, choosing a specific ML algorithm is less important.  For this particular ML modeling exercise, we'll use the RDKit 2D descriptors, which have been successfully used to model physical properties, including aqueous solubility.  ML models will be trained and tested using LightGBM, a method that is both fast and reliable. 

Examining the Overlap Between the TDC and BSD Solubility Set
One of the first things we should do is look at molecules with values in both the TDC and BSD solubility datasets.  As shown in the Jupyter notebook accompanying this post, we can use InChI keys to find that nine molecules are common to both sets.  The plot below compares the TDC and BSD Logs for these compounds.   Since the number of molecules is small, it's hard to say whether there is a linear relationship between the two sets.  In addition, the difference in dynamic range is troubling.  The TDC values span 8 logs, while the corresponding BSD values span slightly more than 1 log.   Should I be worried?  It's a small number of compounds, so it's hard to tell. 



Evaluating the Applicability of the Training Data
Before building our ML model, we should evaluate the relationship between the BSD molecules whose solubility we want to predict and the TDC molecules in the training set. This will give us an idea of whether the BSD molecules lie within the applicability domain of an ML model trained on the TDC.  One of the most common ways of evaluating whether a molecule is within the applicability domain of the model is to calculate its similarity to the training set molecules.  A 2015 paper by Sheridan proposed the mean of the Tanimoto similarities to the five nearest training set neighbors (SIM_5) as an estimate of model applicability.   The plot below shows the distribution of SIM_5 for the molecules in the BSD when compared with the molecules in the TDC training dataset.   The similarity was calculated using Morgan Fingerprints with a radius of 2 (MFP2 aka ECFP4), as implemented in the RDKit.  In short, this doesn't look good.  As shown in the figure below, the mean SIM_5 between the BSD and TDC sets was 0.11.  A recent blog post from Greg Landrum is useful when interpreting a plot like the one below.  In this post, Greg compared the similarity distributions of molecules from the same paper with similarity distributions calculated from randomly selected molecules.   These distributions were then used to establish a relationship between Tanimoto similarity and the probability that two molecules will share a common activity.  According to this analysis, the minimum Tanimoto similarity considered significant with MFP2 is 0.27.  A quick scan of the figure below shows that only a very small fraction of the BSD molecules have SIM_5 > 0.27 when compared to the TDC set.

Cleaning the Training Set
One thing that bothered me about the analysis above was that the similarities weren't calculated using the same descriptors we're using for the ML model.  We used Morgan fingerprints to calculate the similarities but used the RDKit 2D descriptors for the ML model.  We should assess the applicability domain using the same descriptors we use for the ML model.  Before we talk more about the applicability domain of our ML model, we should take a brief divergence into the molecular descriptors and the training data we're using to build the ML model.  As mentioned above, we'll use the 2D descriptors from the RDKit to build our model.  These 209 descriptors capture information about a molecule's shape, topology, functional groups, and charge distribution and encode that information into a vector.  You may have noticed that I mentioned the charge.  Many of the descriptors in the RDKit set depend on calculations of atomic charge.  The RDKit has methods for calculating atomic charges for most organic molecules, but you'll notice that descriptor calculations fail and return "nan" or "not a number" for more exotic molecules.  When building models, it's important to check for these values and to remove the corresponding molecules from the training set.  The TDC solubility set contained 105 molecules where descriptors couldn't be calculated.  Most of these were anticancer agents containing transition metals.  In the interest of simplicity, I also wanted to remove salts and hydrates from the training set.  I could have standardized the molecules and removed salts, but I chose not to do this.  As highlighted in this paper, salts can have a huge influence on aqueous solubility, and  I didn't think it was appropriate to include them in the training set.  I filtered salts and hydrates by removing any molecule containing more than one fragment.  A scan of the TDC solubility dataset found 1098 molecules with more than one fragment.  There was a hydrate in the set with 37 fragments!  After this simple cleanup, the TDC dataset was reduced from 9,982 molecules to 8,884, still a reasonably large training set. 

TSNE Projections Have Limited Utility for Assessing Applicability Domain
After that brief divergence, let's get back to looking at the applicability domain of our model.  One of the things I see a lot in the literature is people using truncated stochastic neighbor embedding (TSNE) plots to show the relationship between a training set and a test set.  I'd urge those interested in learning more about TSNE plots to consult my earlier post on the topic.  In the plot below, I used principle component analysis (PCA) to reduce the 209 RDKit descriptors to 50 dimensions, then projected the 50 principle components (PCs) into two dimensions using TSNE.  In this plot, each point represents a molecule.  Similar molecules are close together, and less similar molecules are farther apart.  To be honest, I don't find plots like this useful.  I can see that the BSD set occupies a subset of the space occupied by the TDC set, but it's hard to tell if the sets overlap or if one is filling the "white space" in the other.  I also don't have a good feel for how much the TSNE projection distorted the original data's dimensionality. 

Adversarial Validation
One technique I like for assessing the applicability of a training set to a particular test set is adversarial validation.  The concept here is simple.  We begin with two sets of descriptors and labels.  In this case, we'll have a set of descriptors calculated from the TDC molecules with the label "TDC" and another set of descriptors calculated from the BSD molecules with the label "BSD".  We then combine the two datasets, shuffle the order, and split them into training and test sets. 

A classification model is then trained to learn the training set labels from the training set descriptors.  The similarity of the training set to the test set is then assessed based on the ability of the classification model to predict the test set labels.  If the training and test sets are similar, we should end up with a poorly performing classification model.  This is one of the few cases where you want to have a bad model.  This may initially seem counterintuitive, but it makes sense.  If the classification model can easily distinguish the training set from the test set, the two descriptor spaces are different, and it will be difficult to interpolate test set properties from the training set.   

I followed the procedure above and began by combining the TDC and BSD sets, shuffled, split into training and test sets, built a classification model, and used the model to predict the test set labels.  After repeating this process ten times, I ended up with a mean area under the receiver operator curve (AUROC) of 0.95.  This means the classification model is very good at distinguishing the BSD set from the TDC set.   This is not a good thing.  The classification results show us that using the TDC solubility data to predict the BSD is probably not going to lead to a good result.  

I like adversarial validation for a few reasons.  
  • It operates in the same descriptor space we use to build the regression model.
  • We can use similar model architectures for applicability estimation and regression model building.  For instance, in this case, I used a LightGBM classifier to estimate applicability and a LightGBM regressor for regression model building. 
  • It doesn't require a "y" variable.   We can estimate the applicability with descriptor sets and labels. 
Predicting Aqueous Solubility Based on Literature Data
After that very long preamble, let's build the model and see how well it works.  My ML model-building process was as described above. 
  • Calculate RDKit 2D descriptors for the TDC solubility set.  
  • Use LightGBM regression to train an ML model to predict LogS based on the TDT descriptors. 
  • Calculate RDKit 2D descriptors for the BDC dataset. 
  • Use the LightGBM regression model to predict LogS for the BSD
  • Compare the experimental and predicted LogS for the BSD 
A comparison of experimental and predicted LogS for the BSD  is shown below.  As we can see, the statistics don't look better than what we achieved with ESOL or the GSE.  



The table below compares Pearson's r, MAE, and the associated 95% confidence intervals for the above mentioned three methods.  


To provide a more robust comparison of the results of the three methods, we can use SciKit PostHocs to compare the distributions of residuals from the models.  A Wilcoxon rank sum test invalidates the null hypothesis that the distributions of residuals are the same.   For more information on these comparisons, please see my previous post on the topic


While there is a statistically significant difference between these models, none of the three provides a practical advantage over the others.  Given the choice, I'd choose ESOL over the GSE or TDC.  The results are similar, and the method depends on a few simple and intuitive terms.   In practice, my choice of method would depend on the goal I was trying to achieve.  If I only had a few molecules I wanted to evaluate, I'd calculate SFI and use the corresponding chart to estimate the probability of the molecules being soluble.  If I had a larger set, I'd calculate ESOL and use that to rank the potential solubility of the molecules. 


Cross-Validating an ML Model With the Biogen Solubility Dataset
When building an ML model, it's always best to predict data taken from the same assay that was used to train the model.  For the examples above, that wasn't the case.  The ML model was fit to datasets taken from the literature.  Datasets like the solubility set in the TDC were assembled from data published by dozens of different labs. As such, there may be inconsistencies.  One advantage to the BSD is that the experimental measurements were all collected in the same lab.  Let's take a look at a model trained by splitting the BSD into training and test sets.  We'll train the model on the training set and then predict on the test set.  To examine the influence of training set composition on model performance, we'll use three different splitting strategies. 
  • Random - randomly assign 80% of the molecules to the training set and 20% to the test set.
  • Scaffold - Some have argued that random splitting is overly optimistic and does not adequately reflect the "real world" performance of an ML model.  One more realistic option is to split the data according to Bemis Murcko scaffolds so that molecules with the same scaffold are in either the training set or the test set but not in both.  
  • SIMPD - Perhaps the best way to evaluate performance is to train a model on molecules synthesized before a given date and test on molecules synthesized after that date.  While this method, commonly known as temporal or time-split cross-validation, may be optimal, information on the order in which molecules were synthesized is typically absent from literature datasets.  A recent paper from Greg Landrum and coworkers describes the SIMPD algorithm, which attempts to approximate temporal splits in datasets. 
The figure below shows the distribution of Pearson's r across 10 cross-validation folds as a function of the splitting method.   The results here are consistent with what we would expect.  The datasets created using a random split are "easier" than those created using scaffold or SIMPD splits.   Interestingly, we can occasionally get lucky with a scaffold split and have a better correlation.  



The box plot below shows the MAE distributions for the same cross-validation folds used above.  Again, the scaffold and SIMPD splits proved more challenging than the random splits.  For the SIMPD splits, the MAE across 10 folds was between 0.33 and 0.50.   This is significantly better than the results we achieved building models with literature data, but we need to take the training data into consideration.  Remember that the model only saw the LogS values for the training set, which lie between -6.7 (200 nM) and -3.2 (630 µM).   Machine learning models tend to only predict values that are within the range specified by the training set.   As we contemplate this result, we must consider whether we could have randomly arrived at a similar MAE. 

Let's say we want to guess the solubility of a molecule from the BSD.  In a prospective situation, we won't know the distribution of solubilities, but we can make a reasonable guess at the range.  If we assign our guesses to random values drawn from a uniform distribution between -7 and -3, we'll have an MAE of around 1.2 logs.  Our model is better than a random guess.  Note that ESOL, GSE, and TDC models give MAE values that, in the aggregate, are marginally better than random guesses.  


In the interest of completeness, we'll also include a plot taken from one cross-validation fold using the SIMPD splits.  While the correlation is considerably better than that achieved with the other methods, we have to remember that the training data has a much smaller dynamic range than what we had with the TDC dataset. In the case of the BSD, regressing to the mean will give us a reasonably good result. 


Conclusions
After all this, what have we learned?

Most of the models we used performed poorly on the BSD.  This is in stark contrast to most ML papers in the drug discovery literature that show values for Pearson's r greater than 0.9 on "standard benchmarks".  Why is this?  It's primarily because the BSD is a realistic dataset, and many of the standard benchmarks aren't.  As I mentioned above, the solubility datasets in MoleculeNet and TDC span a ridiculously large dynamic range.  As such, it's very easy to achieve a good correlation.  Simply using LogP to predict the TDC solubility set gets you a Pearson's r of 0.77.   The problems with "accepted" benchmarks aren't limited to solubility datasets.  Many of the widely used benchmarks are highly flawed.  As I pointed out five years ago, most of the active molecules in the MoleculeNet HIV dataset are likely assay artifacts.  I've become increasingly frustrated when I read papers where people tout the performance of their latest LLM or GNN on benchmarks that don't reflect reality.  Realistic, high-quality datasets are crucial to advancing the application of ML in drug discovery. Consistently measured sets like the BSD are a step in the right direction.  I want to commend Cheng Fang and his coworkers, as well as Biogen, for releasing the data.  This is a tremendous resource for the community. I hope that additional datasets like these will appear and people will begin to use these higher-quality datasets for benchmarking.  

For the BSD, the performance of an ML model trained on literature data wasn't better than simple baseline models.  The solubility model trained on the TDC had an average error of 10-fold on the BSD.  This is no better than the 8-10 fold errors obtained with the ESOL and the GSE.   Given the results of the analyses we did to determine whether the molecules from the BSD are within the applicability domain of the TDC set, this was to be expected.   Incorporating a melting point model into the GSE provided minimal improvements over simple baselines.  In addition to a melting point, the GSE also depends on a predicted value for LogP. A better model for LogP or LogD could improve the results.  This may be something I'll look at in the future.  Although my friend PK has objections to binning data,  I find the SFI to be a useful tool for estimating the probability of a molecule being soluble.  The SFI analysis of the BSD bears this out.  While the method isn't quantitative, it can provide a quick qualitative estimate of whether a particular modification to a molecule will improve its solubility.  

As I mentioned at the beginning of this post, the key to building good ML models is having appropriate data.  If you don't have the right data, the fanciest GNN or LLM won't save you.  The molecules used to train the model should cover the relevant chemical space, and the data should cover the dynamic range you want to predict.  Using different splitting strategies (random, scaffold, SIMPD) can provide some intuition on how the model will perform in practice.  The MAE and Pearson's r values for the scaffold and SIMPD splits on the BSD are equivalent to the prospective predictions reported by Fang and coworkers.  We might have eked out more model performance by employing the GNN approach used in the Fang paper, but I'm not convinced that it would have been worth it. 

This post ended up being a lot longer than I had originally anticipated.  At each step, I came across new things I wanted to try.  Much more could be done with these datasets, but it's time for me to stop.  The code and data are on GitHub, so please experiment if you're so inclined.   

Acknowledgements
I'd like to thank Greg Landrum for helpful discussions and comments on this post.   I'd also like to thank the many authors of the open source tools I used in this analysis. 
























Comments

  1. It is always pleasing to see blog from you that you not only share honest opinion, but also provide future directions and thing to look after. So much appreciate it.

    ReplyDelete
  2. Thanks Pat, this is a great post -- it's helpful to see your thought process laid out so thoroughly. My team and I were also really excited to see this paper and the dataset released by Cheng Fang and the Biogen team. In testing our own models, we observed something that can be seen in all of the scatter plots here too: the prominent bump upwards on the right side of the plot, where the predicted solubility extends much higher than the measured solubility. We were wondering if this could be due to a sensitivity cutoff of the assay. Although the N is very small, this would also be consistent with what you observed by looking at the 9 molecules that overlap between the TDC and BSD sets: there is a cluster of molecules where the BSD LogS is between -4 and -3.7 while the TDC LogS ranges from -3 to -1. Do you think that something like this might be contributing to the difficulty that some of these models are having?

    ReplyDelete

Post a Comment

Popular posts from this blog

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Generative Molecular Design Isn't As Easy As People Make It Look

AI in Drug Discovery - A Highly Opinionated Literature Review (Part II)