Comparing Classification Models - You’re Probably Doing It Wrong
In my last post , I discussed benchmark datasets for machine learning (ML) in drug discovery and several flaws in widely used datasets. In this installment, I’d like to focus on how methods are compared. Every year, dozens, if not hundreds, of papers present comparisons of ML methods or molecular representations. These papers typically conclude that one approach is superior to several others for a specific task. Unfortunately, in most cases, the conclusions presented in these papers are not supported by any statistical analysis. I thought providing an example demonstrating some common mistakes and recommending a few best practices would be helpful. In this post, I’ll focus on classification models. In a subsequent post, I’ll present a similar example comparing regression models. The Jupyter notebook accompanying this post provides all the code for the examples and plots below. If you’re interested in the short version, check out the Jupyter notebook. If you want to know what’s