Some Thoughts on Splitting Chemical Datasets
Introduction Dataset splitting is one topic that doesn’t get enough attention when discussing machine learning (ML) in drug discovery. The data is typically divided into training and test sets when developing and evaluating an ML model. The model is trained on the training set, and its performance is assessed on the test set. If hyperparameter tuning is required, a validation set is also included. Teams often opt for a simple random split, arbitrarily assigning a portion of the dataset (usually 70-80%) as the training set and the rest (20-30%) as the test set. As many have pointed out, this basic splitting strategy often leads to an overly optimistic evaluation of the model's performance. With random splitting, it's common for the test set to contain molecules that closely resemble those in the training set. To address this issue, many groups have turned to scaffold splits. This splitting strategy, inspired by the work of Bemis and Murcko , reduces each molecule to a scaffold