Machine Learning Models Don’t Extrapolate
Introduction
One thing that newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you're new to drug discovery, 1 nM = 0.001 µM. A lower potency value is usually better. It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code accompanying this post is available on GitHub.
A Simple Experiment
Let’s examine one of the simplest models we can create to predict a molecule’s molecular weight (MW) based on its chemical structure. The model will be trained on molecules with molecular weights ranging from 0 to 400. After training, we will evaluate each model’s performance on two test sets. First, we will assess the model’s performance on a separate set of molecules within a similar MW range, which we will call TEST_LT_400. Next, we will conduct a more challenging test using molecules with molecular weights from 500 to 800, referred to as TEST_GT_500. The box plot below compares the molecular weight distributions for the training set and the two test sets.
We used the same training set, which comprised 750 randomly selected molecules from the ChEMBL database with molecular weights below 400 for all models. We observed reasonably good performance when testing the 250 molecules in TEST_LT_400. In every case, our models achieved a Pearson r greater than 0.70. This is logical, as the molecular weights of the training and test sets have similar distributions.
However, the results are not particularly encouraging when we use our model, which is trained on molecules with molecular weights less than 400, to predict the molecular weights of the 250 molecules in TEST_GT_500. First, let’s examine the predicted molecular weights for TEST_GT_500. As shown in the histograms below, LightGBM and ChemProp predict values solely within the range present in the training set. A few predicted values slightly above 400 can be ascribed to model variability. It is important to note that linear regression does predict values outside the training set.
Conclusion
This post may seem obvious, but I frequently encounter these misconceptions when discussing ML with individuals from chemistry or biology backgrounds. It’s crucial to recognize that our models are limited by the range of values present in the training set. For instance, if we create a model to predict yields based on reaction conditions, and all the reactions in our training set have yields below 50%, the model won’t predict any conditions that can achieve a 90% yield. This doesn’t imply that there aren’t conditions capable of producing a 90% yield; it simply indicates that the model has only seen yields below 50%. Similarly, if we only train a model on compounds with poor pharmacokinetics (PK), it is unlikely that the model will predict molecules with good PK. To effectively utilize ML models in drug discovery, we must understand their capabilities and limitations. By setting realistic expectations, ML can be a valuable tool to enhance drug discovery projects.
Comments
Post a Comment