Posts

Even More Thoughts on ML Method Comparisons

Image
  Introduction A few things motivated this post.   Some recent discussions about the virtues of LightGBM vs XGBoost Posts on TabPFN by Jonathan Swain and Chris Swain The release of Osmordred by Guillaume Godin With new methods and descriptor calculation tools emerging in the blogosphere, I wanted to compare a few of them to see if they would be useful in my work. Readers of this blog know that I'm passionate about using appropriate statistical tests when comparing machine learning methods. At the end of last year, a few of us wrote a preprint titled " Practically Significant Method Comparison Protocols for Machine Learning in Small Molecule Drug Discovery ," which outlines statistical tests for method comparison. I'm pleased with what we wrote, but I'm still searching for better ways to visualize the model comparisons. In this post, I'll showcase a few tools I've recently come across and how these tools can be used to visualize ML model performance.  T...

Some Thoughts on Splitting Chemical Datasets

Image
  Introduction Dataset splitting is one topic that doesn’t get enough attention when discussing machine learning (ML) in drug discovery. The data is typically divided into training and test sets when developing and evaluating an ML model. The model is trained on the training set, and its performance is assessed on the test set. If hyperparameter tuning is required, a validation set is also included. Teams often opt for a simple random split, arbitrarily assigning a portion of the dataset (usually 70-80%) as the training set and the rest (20-30%) as the test set. As many have pointed out, this basic splitting strategy often leads to an overly optimistic evaluation of the model's performance. With random splitting, it's common for the test set to contain molecules that closely resemble those in the training set. To address this issue, many groups have turned to scaffold splits. This splitting strategy, inspired by the work of Bemis and Murcko , reduces each molecule to a scaffold...

Silly Things Large Language Models Do With Molecules

Image
  “Pay no attention to the man behind the curtain” - The Wizard of Oz Introduction Recently, a few groups have proposed general-purpose large language models (LLMs) like ChatGPT , Claude , and Gemini as tools for generating molecules. This idea is appealing because it doesn't require specialized software or domain-specific model training. One can provide the LLM with a relatively simple prompt like the one below, and it will respond with a list of SMILES strings. You are a skilled medicinal chemist.  Generate SMILES strings for 100 analogs of the molecule represented by the SMILES CCOC(=O)N1CCC(CC1)N2CCC(CC2)C(=O)N. You can modify both the core and the substituents. Return only the SMILES as a Python list. Don’t put in line breaks. Don't put the prompt into the reply. However, when analyzing molecules created by general-purpose LLMs, I'm reminded of my undergraduate Chemistry days. My roommates, who majored in liberal arts, would often assemble random pieces from my mole...

Digging Deeper into Thompson Sampling - A Guest Blog Post by Patrick Riley

Image
  This week is a special Pat substitution.  Pat(rick) Riley  is taking over this blog post for a follow up on our Thompson Sampling paper. In our recent paper  Thompson Sampling─An Efficient Method for Searching Ultralarge Synthesis on Demand Databases  we showed how you could use the classic Thompson Sampling algorithm to select each reagent in a combinatorial library to conduct an efficient search for a variety of scoring functions. Eagle-eyed readers probably noticed that for ROCS, we presented results searching 0.1% of the library and for docking we searched 1% because "docking with TS requires more sampling than ... ROCS". But why is docking harder for a Thompson Sampling based search than ROCS? This post gives an answer to that question. A Visual Version Remember that in Thompson Sampling, for each component in a reaction, you track statistics for each possible reagent. Each reagent is associated with a distribution of scores for complete molecules because...