Even More Thoughts on ML Method Comparisons
Introduction
A few things motivated this post.
- Some recent discussions about the virtues of LightGBM vs XGBoost
- Posts on TabPFN by Jonathan Swain and Chris Swain
- The release of Osmordred by Guillaume Godin
With new methods and descriptor calculation tools emerging in the blogosphere, I wanted to compare a few of them to see if they would be useful in my work. Readers of this blog know that I'm passionate about using appropriate statistical tests when comparing machine learning methods. At the end of last year, a few of us wrote a preprint titled "Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery," which outlines statistical tests for method comparison. I'm pleased with what we wrote, but I'm still searching for better ways to visualize the model comparisons. In this post, I'll showcase a few tools I've recently come across and how these tools can be used to visualize ML model performance.
The following work compares eight combinations of ML models and descriptors using the Polaris biogen/adme-fang-v1 dataset. For this analysis, I conducted 5x5-fold cross-validation as outlined in our preprint. The benchmark was executed using the run_benchmark.py script from my adme_comparison Git repository.
lgbm_prop - LightGBM with RDKit properties
lgbm_osm - LightGBM with Osmordred properties
lgbm_morgan - LightGBM with RDkit Morgan Fingerprints
xgb_prop - XGBoost with RDKit properties
xgb_osm - XGBoost with Osmordred properties
xgb_morgan - XGBoost with RDkit Morgan Fingerprints
tabpfn - TabPFN with RDKit proprties
chemprop - ChemProp with default parameters
Things I Wish People Wouldn't Do
As a refresher, most papers and blogs that compare ML model performance typically present what I call the "dreaded bold table." This table organizes datasets in columns and shows the mean of a performance metric (e.g., R²) across multiple cross-validation folds in rows. In this context, a higher R² value indicates better performance, so the highest value in each column is highlighted in bold.
Many authors choose a bar plot visually representing the "dreaded bold table."
We can add error bars to the plot above; however, they don’t help us determine whether the differences between methods are statistically significant. In our preprint, we recommend using boxplots to illustrate the variability in the prediction statistics.
While this approach is effective, an additional heatmap is needed to demonstrate statistical significance. I found it confusing to switch back and forth between the boxplot and the heatmap. I aimed to combine the predictions with an indication of statistical significance in a single plot. Annotations can be incorporated into a boxplot to signify statistical significance. In the plot below, "ns" denotes that the differences are not statistically significant.
This works well when we compare two or three methods. However, as the number of methods increases, the plots become unreadable.
Things I Wish People Would Do
The plots below illustrate a concise method for comparing predictions from ML models. These plots utilize Tukey's Honest Significant Difference (HSD) test and are based on code from statsmodels.
- The ML method with the highest mean R^2 is displayed in blue. We could easily substitute other statistical metrics.
- Methods equivalent to the "best" model are represented in grey.
- Methods that show a statistically significant difference from the "best" model are indicated in red.
- The bars associated with each method illustrate the confidence intervals adjusted for multiple comparisons.
- Dashed vertical lines surround the confidence intervals for the "best" method.
Here's an example illustrating eight models calculated with the same train/test splits across 5x5-fold cross-validation using the Human Plasma Protein Binding dataset from the Polaris biogen/adme-fang-v1 dataset.
In the figure below, we compare eight methods across six datasets. This visualization offers a simple and effective way to evaluate model performance. However, the Blogger platform limits the figure resolution, making the plots below insufficiently represent the method. Check out the Jupyter Notebook for this post to see how we can fit twelve easily readable plots into a compact space. While I used R² in these plots, they will work just as well with your preferred metric.
In a discussion with the Polaris Small Molecule Steering Group, Nils Weskamp noted that the plots above may not provide sufficient detailed information to effectively compare two methods. In this context, I prefer using paired plots like the ones below. These plots display boxplots of method performance, where we compare R² values for LightGBM with Osmordred descriptors against the corresponding values for TabPFN and the RDKit descriptors. The lines connecting the points represent R² values for the same cross-validation fold. If the line is green, the method on the right performs better; if it's red, the technique on the left outperforms the other method. The headers on the plots below indicate statistical significance using Student's t-test, with the plot title colored accordingly. If the difference is statistically significant (p ≤ 0.05), the header appears in red with an arrow pointing to the method with the higher R². The plot title is black if the p-value from the t-test exceeds 0.05.
A Few Parting Thoughts
I read a lot of papers and blog posts on ML in drug discovery. Nothing, except the lack of source code or the use of poor-quality datasets like MoleculeNet, raises my blood pressure more than papers where the authors only include a bar plot or the dreaded bold table when comparing methods. I may be tilting at windmills, but I'll keep writing about this until the situation changes.
A few caveats regarding this analysis: I used all these methods with the default parameters and did not perform any hyperparameter tuning. Someone more knowledgeable about a specific method than I may significantly enhance performance. The code and data are available, and I would love to see how performance could be improved. My intent here isn’t to demonstrate that one method is the best, although TabPFN is certainly impressive. I wanted to share some visualization ideas and continue a conversation that Anthony Nicholls initiated almost a decade ago.
The README file for the adme_comparison repository outlines how to use the benchmarking framework. If you have any corrections or suggestions for improvements, please feel free to reach out or submit a pull request. Lastly, I wish to thank Greg Landrum, Nils Weskamp, and Guillaume Godin for their valuable discussions on this post.
As usual, all the code for this post is available on GitHub.
Pat - as in our previous discussion in the comments of your blog post "Comparing Classification Models - You’re Probably Doing It Wrong", I think a consideration of aleatoric uncertainty and/or measurement error is important. For example, if you're trying to build a model of receptor functional activity, where the measurements in the training set can have errors of +/- 3-fold, differences in the average cross-validation errors that, considered in isolation, are statistically different can quickly become irrelevant. This is surely less severe for biochemical assays and other modalities, but the principal is the same.
ReplyDeleteMore controversially, considering that model performance has a complex (and chaotic) relationship with both the nature and particular instance of the data, is it worth such intense statistical scrutiny? Isn't this like cutting your lawn with a scalpel? Aren't the relatively big differences the only ones that are of practical concern?
Thanks for the comment, Lee. I've written about the impact of experimental error on model performance in the past. https://practicalcheminformatics.blogspot.com/2019/07/how-good-could-should-my-models-be.html In that post, I demonstrated that good models can still be achieved despite typical experimental error. While I agree that error must be considered when comparing models, I don't believe it negates the need for thorough statistical comparison.
ReplyDeleteMaybe "intense scrutiny" was a poor choice of words. I wasn't implying at all that all comparisons should be abandoned, rather, I was suggesting that you didn't go far enough, and that there are at least two reasons (i.e. the capricious nature of data and their effect on models, and experimental error) why one needs to be circumspect when estimating model performance. Cutoffs for the models considered to be in the top performing group may have to be expanded when one determines which models should be considered functionally equivalent (i.e. the grey ones in your plots). Cross validation or bootstrapping may be practical (lower) limit for estimating the first effect of shifts in the data (since we don't know what we don't know), but the second one is usually ignored.
DeleteA very simple first step could be to use the same number of significant figures for model performance as should be used (realistically!) to report the experimental results themselves.