Even More Thoughts on ML Method Comparisons

March 07, 2025

Introduction

A few things motivated this post.

Some recent discussions about the virtues of LightGBM vs XGBoost
Posts on TabPFN by Jonathan Swain and Chris Swain
The release of Osmordred by Guillaume Godin

With new methods and descriptor calculation tools emerging in the blogosphere, I wanted to compare a few of them to see if they would be useful in my work. Readers of this blog know that I'm passionate about using appropriate statistical tests when comparing machine learning methods. At the end of last year, a few of us wrote a preprint titled "Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery," which outlines statistical tests for method comparison. I'm pleased with what we wrote, but I'm still searching for better ways to visualize the model comparisons. In this post, I'll showcase a few tools I've recently come across and how these tools can be used to visualize ML model performance.

The following work compares eight combinations of ML models and descriptors using the Polaris biogen/adme-fang-v1 dataset. For this analysis, I conducted 5x5-fold cross-validation as outlined in our preprint. The benchmark was executed using the run_benchmark.py script from my adme_comparison Git repository.

lgbm_prop - LightGBM with RDKit properties

lgbm_osm - LightGBM with Osmordred properties

lgbm_morgan - LightGBM with RDkit Morgan Fingerprints

xgb_prop - XGBoost with RDKit properties

xgb_osm - XGBoost with Osmordred properties

xgb_morgan - XGBoost with RDkit Morgan Fingerprints

tabpfn - TabPFN with RDKit proprties

chemprop - ChemProp with default parameters

Things I Wish People Wouldn't Do

As a refresher, most papers and blogs that compare ML model performance typically present what I call the "dreaded bold table." This table organizes datasets in columns and shows the mean of a performance metric (e.g., R²) across multiple cross-validation folds in rows. In this context, a higher R² value indicates better performance, so the highest value in each column is highlighted in bold.

Many authors choose a bar plot visually representing the "dreaded bold table."

We can add error bars to the plot above; however, they don’t help us determine whether the differences between methods are statistically significant. In our preprint, we recommend using boxplots to illustrate the variability in the prediction statistics.

While this approach is effective, an additional heatmap is needed to demonstrate statistical significance. I found it confusing to switch back and forth between the boxplot and the heatmap. I aimed to combine the predictions with an indication of statistical significance in a single plot. Annotations can be incorporated into a boxplot to signify statistical significance. In the plot below, "ns" denotes that the differences are not statistically significant.

This works well when we compare two or three methods. However, as the number of methods increases, the plots become unreadable.

Things I Wish People Would Do

The plots below illustrate a concise method for comparing predictions from ML models. These plots utilize Tukey's Honest Significant Difference (HSD) test and are based on code from statsmodels.

The ML method with the highest mean R^2 is displayed in blue. We could easily substitute other statistical metrics.
Methods equivalent to the "best" model are represented in grey.
Methods that show a statistically significant difference from the "best" model are indicated in red.
The bars associated with each method illustrate the confidence intervals adjusted for multiple comparisons.
Dashed vertical lines surround the confidence intervals for the "best" method.

Here's an example illustrating eight models calculated with the same train/test splits across 5x5-fold cross-validation using the Human Plasma Protein Binding dataset from the Polaris biogen/adme-fang-v1 dataset.

In the figure below, we compare eight methods across six datasets. This visualization offers a simple and effective way to evaluate model performance. However, the Blogger platform limits the figure resolution, making the plots below insufficiently represent the method. Check out the Jupyter Notebook for this post to see how we can fit twelve easily readable plots into a compact space. While I used R² in these plots, they will work just as well with your preferred metric.

In a discussion with the Polaris Small Molecule Steering Group, Nils Weskamp noted that the plots above may not provide sufficient detailed information to effectively compare two methods. In this context, I prefer using paired plots like the ones below. These plots display boxplots of method performance, where we compare R² values for LightGBM with Osmordred descriptors against the corresponding values for TabPFN and the RDKit descriptors. The lines connecting the points represent R² values for the same cross-validation fold. If the line is green, the method on the right performs better; if it's red, the technique on the left outperforms the other method. The headers on the plots below indicate statistical significance using Student's t-test, with the plot title colored accordingly. If the difference is statistically significant (p ≤ 0.05), the header appears in red with an arrow pointing to the method with the higher R². The plot title is black if the p-value from the t-test exceeds 0.05.

A Few Parting Thoughts

I read a lot of papers and blog posts on ML in drug discovery. Nothing, except the lack of source code or the use of poor-quality datasets like MoleculeNet, raises my blood pressure more than papers where the authors only include a bar plot or the dreaded bold table when comparing methods. I may be tilting at windmills, but I'll keep writing about this until the situation changes.

A few caveats regarding this analysis: I used all these methods with the default parameters and did not perform any hyperparameter tuning. Someone more knowledgeable about a specific method than I may significantly enhance performance. The code and data are available, and I would love to see how performance could be improved. My intent here isn’t to demonstrate that one method is the best, although TabPFN is certainly impressive. I wanted to share some visualization ideas and continue a conversation that Anthony Nicholls initiated almost a decade ago.

The README file for the adme_comparison repository outlines how to use the benchmarking framework. If you have any corrections or suggestions for improvements, please feel free to reach out or submit a pull request. Lastly, I wish to thank Greg Landrum, Nils Weskamp, and Guillaume Godin for their valuable discussions on this post.

As usual, all the code for this post is available on GitHub.

Comments

Lee HermanMarch 9, 2025 at 7:20 PM
Pat - as in our previous discussion in the comments of your blog post "Comparing Classification Models - You’re Probably Doing It Wrong", I think a consideration of aleatoric uncertainty and/or measurement error is important. For example, if you're trying to build a model of receptor functional activity, where the measurements in the training set can have errors of +/- 3-fold, differences in the average cross-validation errors that, considered in isolation, are statistically different can quickly become irrelevant. This is surely less severe for biochemical assays and other modalities, but the principal is the same.

More controversially, considering that model performance has a complex (and chaotic) relationship with both the nature and particular instance of the data, is it worth such intense statistical scrutiny? Isn't this like cutting your lawn with a scalpel? Aren't the relatively big differences the only ones that are of practical concern?
ReplyDelete
Replies
Pat WaltersMarch 10, 2025 at 7:00 AM
Thanks for the comment, Lee. I've written about the impact of experimental error on model performance in the past. https://practicalcheminformatics.blogspot.com/2019/07/how-good-could-should-my-models-be.html In that post, I demonstrated that good models can still be achieved despite typical experimental error. While I agree that error must be considered when comparing models, I don't believe it negates the need for thorough statistical comparison.
ReplyDelete
Replies
Curtis ColwellMarch 11, 2025 at 5:37 AM
Small detail, but something I noticed when trying to interpret the multiple rows of wide plots is that it's not immediately obvious which data belongs to which label. If you put the labels on the right instead of the left, it would be easier to interpret.
ReplyDelete
Replies
HenryChristopherApril 15, 2025 at 2:11 AM
Thanks for your meaningfull blogspot! I wanna ask your opinion about the randomness of machine learning/deep learning model. Assume i want to compare two deep learning models on single test set, each model i train and predict multiple times on test set. After that i will have results of 100 times for each model (for example the result is RMSE value). Is it correct to performing statistical comparision between these two models base on their 100 RMSE value? And which is the most suitable test for doing this? Many thanks!
ReplyDelete
Replies

Add comment

Search This Blog

Practical Cheminformatics