I was taken aback by a recent CNBC article entitled “ Generative AI will be designing new drugs all on its own in the near future ”. I should know better than to pay attention to AI articles in the popular press, but I feel that even scientists working in drug discovery may have a skewed perception of what generative AI can and can’t do. To understand exactly what’s involved, it might be instructive to walk through a typical generative molecular design workflow and point out a few things. First, these programs are far from autonomous. Even when presented with a well-defined problem, generative algorithms produce a tremendous amount of nonsense. Second, domain expertise is essential when sifting through the molecules produced by a generative algorithm. Without a significant medicinal chemistry background, one can’t make sense of the results. Third, while a few nuggets exist in the generative modeling output, a lot of work and good old-fashioned c...
Most papers describing new methods for machine learning (ML) in drug discovery report some sort of benchmark comparing their algorithm and/or molecular representation with the current state of the art. In the past, I’ve written extensively about statistics and how methods should be compared . In this post, I’d like to focus instead on the datasets we use to benchmark and compare methods. Many papers I’ve read recently use the MoleculeNet dataset, released by the Pande group at Stanford in 2017, as the “standard” benchmark. This is a mistake. In this post, I’d like to use the MoleculeNet dataset to point out flaws in several widely used benchmarks. Beyond this, I’d like to propose some alternate strategies that could be used to improve benchmarking efforts and help the field to move forward. To begin, let’s examine the MoleculeNet benchmark, which to date, has been cited more than 1,800 times. The MoleculeNet collection consis...
In my last post , I discussed benchmark datasets for machine learning (ML) in drug discovery and several flaws in widely used datasets. In this installment, I’d like to focus on how methods are compared. Every year, dozens, if not hundreds, of papers present comparisons of ML methods or molecular representations. These papers typically conclude that one approach is superior to several others for a specific task. Unfortunately, in most cases, the conclusions presented in these papers are not supported by any statistical analysis. I thought providing an example demonstrating some common mistakes and recommending a few best practices would be helpful. In this post, I’ll focus on classification models. In a subsequent post, I’ll present a similar example comparing regression models. The Jupyter notebook accompanying this post provides all the code for the examples and plots below. If you’re interested in the short version, check out the Jupyt...
Comments
Post a Comment