How Good Could (Should) My Models Be?

July 28, 2019

One factor that is often overlooked when building a predictive model is the impact of the experimental error on the performance of the model. In this post, we will examine one technique for estimating the impact of the experimental error on correlation and estimating the maximum possible correlation that can be achieved with a particular dataset.

This post was motivated by an excellent series, called Bucket List Papers, that my friends at MedChemica recently initiated. In this series, they are highlighting 100 important papers in Medicinal and Computational Chemistry. All of the papers they’ve featured so far have been classics. I’d recommend their Bucket List as essential reading for anyone involved in Cheminformatics, Computer-Aided Drug Design, or Medicinal Chemistry.

I’m not quite so ambitious, but I would like to use this post to highlight an important and frequently overlooked paper. As usual, we’ll also look at some code to implement the method described in the paper. The paper entitled “Healthy skepticism: assessing realistic model performance” was written in 2009 by Scott Brown, Steve Muchmore, and Phil Hajduk, when they were employed at Abbott Labs. In the paper, the authors develop a simple simulation method to estimate the impact of the experimental error on the correlation that can be achieved with a predictive model.

If you stop and think about it, this makes intuitive sense. If our data has a large experimental error it should reduce the probability that our model will have a high correlation. On the other hand, if our data has relatively “tight” error bars, we should be able to generate a model with a higher correlation. In the paper above, the authors develop an elegant, simple model for estimating the maximum possible correlation that can be achieved given a specific experimental error.

In this method, we begin with a set of experimental observables. As an example, we’ll use the DLS100 dataset, a set of aqueous solubilities measured by dynamic light scattering by John Mitchell and James McDonagh at the University of St. Andrews. We will then add normally distributed error to the data and measure the correlation between the original experimental data and the experimental data plus the error. The error is normally distributed with a mean of 0 and a standard deviation equal to the error in the experiment. In the figure below we can see the impact of 3-fold, 5-fold, and 10-fold error. As expected, an increase in the experimental error will reduce the correlation.

At this point our analysis has been purely qualitative, let’s take a look at how we can make this quantitative. One way to estimate an upper limit for the correlation given an experimental error is to repeat the procedure described above several times. In this case we'll do 1000 trials each at 3-fold, 5-fold, and 10-fold error. For a dataset X with N values, the procedure can be outlined as:

For 1000 trials:

Generate N normally distributed random variables R with a mean of 0 and a standard deviation of log10(A) where A is the fold error.
Add R and X to create a new vector RX
Calculate the correlation between X and RX

We can depict the results of these simulations with the box plots below. Based on these boxplots we can get an estimate of the maximum correlation that can be achieved, given a specific experimental error. We can also get an estimate of how much our model will improve if we are able to reduce the error in the experiment.

At a recent conference, I had a number of great conversations with readers of this blog. Several people mentioned that they prefer violin plots to box plots. I'm a bit torn on this. I like violin plots because they enable me to examine the shape of a distribution. However, I like the way that a box plot enables me to quickly assess the middle 50% of a distribution. Yes, the middle 50% is there in the violin plot, but it's somewhat buried. Of course, it's a simple code change to get Seaborn to generate a violin plot. So, for the violin plot fans, here it is.

Perhaps the most interesting aspect of the paper above, by Brown, Muchmore, and Hajduk, is the analysis of correlations reported in 16 recent life science publications. The authors found that, in half the cases, the reported correlations exceeded the maximum expected correlation given a typical experimental error. The take-home here is that it's useful to perform this simple analysis to determine whether the correlation you achieve with a model is consistent with what would be expected given the error in your experiments.

As usual, the code to implement this method and to generate all of the plots above is on GitHub. If you'd like to run the Jupyter Notebook without having to install anything, you can use this link to run the notebook on Google Colab.

Search This Blog

Practical Cheminformatics

How Good Could (Should) My Models Be?

Comments

Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

Silly Things Large Language Models Do With Molecules

Some Thoughts on Splitting Chemical Datasets