Search This Blog

Practical Cheminformatics

Fast Parallel Cheminformatics Workflows With Dask

Get link
Facebook
X
Pinterest
Email
Other Apps

March 30, 2021

This is just a pointer to my new fastpages blog site.

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

We Need Better Benchmarks for Machine Learning in Drug Discovery

August 03, 2023

Most papers describing new methods for machine learning (ML) in drug discovery report some sort of benchmark comparing their algorithm and/or molecular representation with the current state of the art. In the past, I’ve written extensively about statistics and how methods should be compared . In this post, I’d like to focus instead on the datasets we use to benchmark and compare methods. Many papers I’ve read recently use the MoleculeNet dataset, released by the Pande group at Stanford in 2017, as the “standard” benchmark. This is a mistake. In this post, I’d like to use the MoleculeNet dataset to point out flaws in several widely used benchmarks. Beyond this, I’d like to propose some alternate strategies that could be used to improve benchmarking efforts and help the field to move forward. To begin, let’s examine the MoleculeNet benchmark, which to date, has been cited more than 1,800 times. The MoleculeNet collection consis...

Silly Things Large Language Models Do With Molecules

October 08, 2024

“Pay no attention to the man behind the curtain” - The Wizard of Oz Introduction Recently, a few groups have proposed general-purpose large language models (LLMs) like ChatGPT , Claude , and Gemini as tools for generating molecules. This idea is appealing because it doesn't require specialized software or domain-specific model training. One can provide the LLM with a relatively simple prompt like the one below, and it will respond with a list of SMILES strings. You are a skilled medicinal chemist. Generate SMILES strings for 100 analogs of the molecule represented by the SMILES CCOC(=O)N1CCC(CC1)N2CCC(CC2)C(=O)N. You can modify both the core and the substituents. Return only the SMILES as a Python list. Don’t put in line breaks. Don't put the prompt into the reply. However, when analyzing molecules created by general-purpose LLMs, I'm reminded of my undergraduate Chemistry days. My roommates, who majored in liberal arts, would often assemble random pieces from my mole...

Some Thoughts on Splitting Chemical Datasets

November 16, 2024

Introduction Dataset splitting is one topic that doesn’t get enough attention when discussing machine learning (ML) in drug discovery. The data is typically divided into training and test sets when developing and evaluating an ML model. The model is trained on the training set, and its performance is assessed on the test set. If hyperparameter tuning is required, a validation set is also included. Teams often opt for a simple random split, arbitrarily assigning a portion of the dataset (usually 70-80%) as the training set and the rest (20-30%) as the test set. As many have pointed out, this basic splitting strategy often leads to an overly optimistic evaluation of the model's performance. With random splitting, it's common for the test set to contain molecules that closely resemble those in the training set. To address this issue, many groups have turned to scaffold splits. This splitting strategy, inspired by the work of Bemis and Murcko , reduces each molecule to a scaffold...