Some Thoughts on Evaluating Predictive Models

I'd like to use this post to provide a few suggestions for those writing papers that report the performance of predictive models.  This isn't meant to be a definitive checklist, just a few techniques that can make it easier for the reader to assess the performance of a method or model.

As is often the case, this post was motivated by a number of papers in the recent literature.  I'll use one of these papers to demonstrate a few things that I believe should be included as well as a few that I believe should be avoided.  My intent is not to malign the authors or the work, I simply want to illustrate a few points that I consider to be important.  As usual, all of the code I used to create this analysis is in GitHub

For this example, I'll use a recent paper in ChemRxiv which compares the performance of two methods for carrying out free energy calculations.  

1. Avoid putting multiple datasets on the same plot, especially if the combined performance is not relevant.  In the paper mentioned above, the authors display the results for 8 different datasets, representing correlations for 8 different targets, in the same plot.  This practice appears to have become commonplace, and I've seen similar plots in numerous other papers.   The plots below are my recreations from the supporting material available with the paper (kudos to the authors for enabling others to reanalyze their results).  The plots compare the experimental ΔG (binding free energy) with the ΔG calculated using two different flavors of free energy calculations.  Lines are drawn at 2 kcal/mol above and below the unity line to highlight calculated values that fall outside the range of predictions typically considered to be "good". 


If we look at the plots above, we can see that the plot on the right appears to have a few more points outside the 2 kcal/mol window, but I find it hard to discern anything else.  The inclusion of multiple datasets, each of which spans a portion of the overall dynamic range, can also give the illusion that the fit is more linear than it actually is.  

An alternate approach is to trellis the data and plot the results side by side for each target.   Note that I've included the coefficient of determination on each plot. 


2. Include the appropriate metrics to assess the predictions.  I've seen quite a few papers that only include the root mean squared error (RMSE) or mean absolute error (MAE) and don't report any statistics having to do with correlation.  Some people will argue that correlation statistics will be unrealistically pessimistic for datasets with a small dynamic range.  I can also argue that RMSE or MAE will be unrealistically optimistic for datasets with a small dynamic range.  At the end of the day, people have differing views on which statistics should be included, so why not include them all.  My recommendation would be to include the methods listed below.  There is open source code in scikit learn or scipy to calculate all of these.   

Rather than reporting the Pearson r, I prefer to use its squared value, the coefficient of determination.  The R2 value expresses the amount of variance in the dependent variable that is predictable from the independent variable(s). 

3.  Include error bars on your plots.  Every month I see new papers which compare the performance of a variety of predictive models.  In most cases, the performance metric, which is usually an RMSE or a correlation coefficient, is plotted as a bar graph.  These graphs rarely have error bars.  It's useful and pretty easy, to calculate error bars for your correlations or for a measure like RMSE.   This can be done by bootstrapping or by analytically calculating a confidence interval for the correlation.  I talked about this a bit in a previous post, but the best discussion I know of is in this paper by Anthony Nicholls.  In the plot below, I display R2 for each of the datasets as a bar graph with error bars drawn at the 95% confidence interval.  Note that the 95% confidence interval for these correlations is very large.  This is because the confidence interval is a function of two factors, the correlation and the number of datapoints used to calculate the correlation.  If the correlation is low or the number of datapoints is small, the confidence interval will be broader.  


4.  Calculate the effect size for the difference in the correlations.   As Anthony Nicholls points out in the paper mentioned above, when we compare two methods on the same dataset, the errors are not independent.  As such, we cannot determine whether an overlap in the error bars indicates that the methods are not different at the 95% confidence level.  One widely used method of comparing models is to calculate the effect size.  A useful discussion of the effect size, along with some examples and code can be found on this page.  One of the most commonly used measures of effect size is Cohen's d, which for dependent samples like the ones we're dealing with, can simply be calculated as the mean of the differences in the metric being calculated, divided by the standard deviation of the difference. 
Note that the formula above is slightly different from the one on the aforementioned web page.  That particular example treats independent samples, where a pooled standard deviation is used.   That page has a nice description of Cohen's d and it's interpretation, which I will quote here. 

Cohen’s d measures the difference between the mean from two Gaussian-distributed variables. It is a standard score that summarizes the difference in terms of the number of standard deviations. Because the score is standardized, there is a table for the interpretation of the result, summarized as:

  • Small Effect Size: d=0.20
  • Medium Effect Size: d=0.50
  • Large Effect Size: d=0.80
If we calculate Cohen's d for the example above, we get a value of 1.2 indicating a very large effect and a significant difference between the methods. 

We can also use the equation below to translate Cohen's d into a probability between 0 and 1. 


If we plug a Cohen's d of 1.2 into the equation above, we end up with a probability of 0.8. Based on the 8 datasets we compared, FEP should outperform TI 8 times out of 10. 

This post provides an overview of a few factors that should be considered by authors and by reviewers of computational papers.  There is a lot more to be discussed and I hope I can address related topics in future posts.  Those wanting to know more about this very important topic are urged to read these three papers by Anthony Nicholls
as well as this one by Ajay Jain and Anthony Nicholls.











Comments

Popular posts from this blog

Generative Molecular Design Isn't As Easy As People Make It Look

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)