Machine Learning Models Don’t Extrapolate


Introduction 
One thing that newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you're new to drug discovery, 1 nM = 0.001 µM. A lower potency value is usually better.  It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code accompanying this post is available on GitHub.  

A Simple Experiment
Let’s examine one of the simplest models we can create to predict a molecule’s molecular weight (MW) based on its chemical structure. The model will be trained on molecules with molecular weights ranging from 0 to 400. After training, we will evaluate each model’s performance on two test sets. First, we will assess the model’s performance on a separate set of molecules within a similar MW range, which we will call TEST_LT_400. Next, we will conduct a more challenging test using molecules with molecular weights from 500 to 800, referred to as TEST_GT_500. The box plot below compares the molecular weight distributions for the training set and the two test sets. 



We will employ two distinct methods to build the model and prevent bias arising from the model architecture. First, we will utilize Morgan count fingerprints calculated with RDKit as descriptors, constructing a model using LightGBM, an ensemble method that generates multiple decision trees. Simultaneously, we will develop a model with ChemProp, which uses a message-passing neural network (MPNN) for molecular representation and then utilizes a feed-forward neural network (FFNN) for training and inference. As a control, we will also include linear regression with the same fingerprints used by LightGBM. Since linear regression simply generates a set of coefficients for a given set of variables, it should allow us to predict values outside the training set. 

Testing on Similar Data Distributions
We used the same training set, which comprised 750 randomly selected molecules from the ChEMBL database with molecular weights below 400 for all models. We observed reasonably good performance when testing the 250 molecules in TEST_LT_400. In every case, our models achieved a Pearson r greater than 0.70. This is logical, as the molecular weights of the training and test sets have similar distributions.



Testing on Dissimilar Distributions
However, the results are not particularly encouraging when we use our model, which is trained on molecules with molecular weights less than 400, to predict the molecular weights of the 250 molecules in TEST_GT_500. First, let’s examine the predicted molecular weights for TEST_GT_500. As shown in the histograms below, LightGBM and ChemProp predict values solely within the range present in the training set. A few predicted values slightly above 400 can be ascribed to model variability. It is important to note that linear regression does predict values outside the training set. 

As expected from the distributions of the predicted values for MW_GT_500, both LGBM and ChemProp demonstrate poor performance. In these cases, there is no correlation between the actual and predicted molecular weights. Conversely, linear regression can effectively extrapolate into the higher molecular weight range associated with MW_GT_500. Does this imply that we should disregard more advanced ML methods and solely rely on linear regression? Probably not. Unfortunately, linear regression is not equipped to address the non-linear relationships that modern ML methods can capture. That said, it’s always beneficial to begin with simpler methods. You might sometimes be pleasantly surprised by the results. 


Conclusion

This post may seem obvious, but I frequently encounter these misconceptions when discussing ML with individuals from chemistry or biology backgrounds. It’s crucial to recognize that our models are limited by the range of values present in the training set. For instance, if we create a model to predict yields based on reaction conditions, and all the reactions in our training set have yields below 50%, the model won’t predict any conditions that can achieve a 90% yield. This doesn’t imply that there aren’t conditions capable of producing a 90% yield; it simply indicates that the model has only seen yields below 50%. Similarly, if we only train a model on compounds with poor pharmacokinetics (PK), it is unlikely that the model will predict molecules with good PK. To effectively utilize ML models in drug discovery, we must understand their capabilities and limitations. By setting realistic expectations, ML can be a valuable tool to enhance drug discovery projects. 


Acknowledgments
Thanks to Brian Goldman and Mark Murcko for helpful feedback. 





Comments

Popular posts from this blog

Generative Molecular Design Isn't As Easy As People Make It Look

We Need Better Benchmarks for Machine Learning in Drug Discovery

Comparing Classification Models - You’re Probably Doing It Wrong