Here’s the first part of my review of some interesting machine learning (ML) papers I read in 2023. As with the previous editions, this shouldn’t be considered a comprehensive review. The papers covered here reflect my research interests and biases, and I’ve certainly overlooked areas that others consider vital. This post is pretty long, so I've split it into three parts, with parts II and III to be posted in the next couple of weeks.
I. Docking, protein structure prediction, and benchmarking
II. Large Language Models, active learning, federated learning, generative models, and explainable AI
III. Review articles
2023 was a bit of a mixed bag for AI in drug discovery. Several groups reported that the deep learning methods for protein-ligand docking weren’t quite what they were initially cracked up to be. AlphaFold2 became pervasive, and people started to investigate, with mixed success, the utility of predicted protein structures. There were reports of significant advances in protein-ligand docking, but no code or supporting methodology was provided. Finally, several benchmarking studies cast doubt on earlier claims that deep learning and foundation models outperformed more traditional ML methods. For the impatient, here’s the structure of Part I.
1. Are Deep Learning Methods Useful for Docking?
1.1 Are the Comparisons Fair?
1.2 Training/test Set Bias
1.3 Structure Quality
1.4 Reporting Scientific Advances in Press Releases
2. Can We Use AlpaFold2 Structures for Ligand Discovery and Design?
2.1 Experimentally Evaluating AlphaFold2 Structures
2.2 Generating Multiple Protein Conformations with AlphaFold2
2.3 Docking into AlphaFold2 Structures
3. Can We Build Better Benchmarks?
3.1 Overviews
3.2 Benchmark comparisons
3.3 Dataset splitting
3.4 New datasets
1. Are Deep Learning Methods Useful for Docking?
2022 saw the emergence of deep learning (DL) methods for docking. These methods, trained on data from the PDB, learned to predict the poses of ligands based on interactions in known protein-ligand complexes. There were papers on DiffDock, Eqibind, TANKBind, and more. In 2023, these methods underwent additional scrutiny, and it turned out that they weren’t quite as good as originally reported. Criticism of DL docking methods fell into three categories: the methods used for comparison, biases in the datasets used for evaluation, and the quality of the generated structures.
1.1 Are the Comparisons Fair?
One potential advantage of DL docking programs is their ability to perform “blind docking”. Unlike conventional docking programs, the DL methods don’t require the specification of a binding site. The DL programs use training data to infer the protein binding site and the ligand pose. In earlier comparative studies, conventional docking programs were simply given an entire protein structure without binding site specifications. Since this is not how they were designed to operate, the conventional methods were slow and inaccurate. A preprint by Yu and coworkers at DP Technologies decomposed blind docking into two problems: pocket finding and docking into a predefined pocket. The authors found that DL docking programs excelled at pocket finding but didn’t perform as well as conventional methods when pockets are predefined.
Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking
https://arxiv.org/abs/2302.07134
1.2 Training/test Set Bias
Most DL docking programs were trained and tested on time splits from the PDB. For instance, DiffDock was trained on structures deposited in the PDB before 2019 and tested on structures deposited in 2019 and later. Quite a few structures in the test set are similar to those in the training set. In these cases, prediction becomes a simple table lookup. One way to address this bias is to create train/test splits that don’t contain similar structures.
A paper by Kanakala and coworkers from IIT analyzed several datasets commonly used for affinity prediction, including PDBBind and KIBA, and found that typical splitting methods overestimate model performance. The authors propose a clustered cross-validation strategy that provides more realistic estimates of model performance.
Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
https://pubs.acs.org/doi/10.1021/acsomega.2c06781
A preprint by Li and coworkers from UC Berkeley described a similar effort. The authors cleaned the PDBBind dataset and divided it into segments that minimized leakage between the training and test sets. This new dataset was then used to retrain and evaluate several widely used scoring functions.
Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
https://arxiv.org/abs/2308.09639
1.3 Structure Quality
The third problem with many DL docking programs is the quality of the generated structures. To put it technically, the structures were really messed up. Bond lengths and angles were off, and there were often steric clashes with the protein. To address these challenges, Buttenschoen and colleagues from Oxford University developed PoseBusters, a Python package for evaluating the quality of docked poses. PoseBusters performs a series of geometry checks on docked poses and also evaluates intra and inter-molecular interactions. The authors used the Astex Diverse Set and a newly developed PoseBusters benchmark set to evaluate five popular deep learning docking programs and two conventional docking approaches. The conventional docking programs dramatically outperformed the deep learning methods on both datasets. In most cases, more than half of the solutions generated by the DL docking programs failed the PoseBusters validity tests. In contrast, with the conventional docking programs, only 2-3% of the docked poses failed to validate.
PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences
https://pubs.rsc.org/en/content/articlepdf/2024/sc/d3sc04185a
Many of the same problems encountered with DL methods for docking can also impact generative models that produce structures in the context of a protein binding site. A paper by Harris and coworkers from the University of Cambridge describes PoseCheck, a tool similar to PoseBusters, for identifying unrealistic structures. PoseCheck evaluates steric clashes, ligand strain energy, and intramolecular interactions to identify problematic structures. In addition, structures are redocked with AutoDock Vina to confirm the validity of the proposed binding mode. In evaluating several recently published generative models, the authors identify failure modes that will hopefully influence future work on structure-based generative design.
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models
https://arxiv.org/abs/2308.07413
1.4 Reporting Scientific Advances in Press Releases
The other (potentially) significant docking developments in 2023 weren’t reported in preprints or papers; they were published in what can best be described as press releases. In early October, the Baker group at the University of Washington published a short preprint that previews RoseTTAFold All-Atom, the latest incarnation of their RoseTTAFold software for protein structure prediction. In a brief section entitled “Predicting Protein-Small Molecule Complexes”, the authors mention their efforts to generate structures of bound non-covalent and covalent small molecule ligands. On benchmark structures from the CAMEO blind docking competition, RoseTTAFold All-Atom generated high-quality structures (<2Å RMSD) in 32% of cases. This compared favorably to an 8% success rate for the conventional docking program AutoDock Vina.
Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom
https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1.full.pdf
In late October, the DeepMind group published a blog post entitled “A glimpse of the next generation of AlphaFold,” where, among other things, they made this statement.
“Our latest model sets a new bar for protein-ligand structure prediction by outperforming the best reported docking methods, without requiring a reference protein structure or the location of the ligand pocket — allowing predictions for completely novel proteins that have not been structurally characterized before.”
The accompanying whitepaper provided impressive performance statistics for the PoseBusters set described above. The AlphaFold method achieved a 73.6% success rate compared to 52.3% for the conventional docking program AutoDock Vina. The AlphaFold performance was even more impressive when considering how the comparison was performed. While Vina was provided protein coordinates and a binding site as input, AlphaFold was only given the protein sequence and a SMILES string for the ligand.
A glimpse of the next generation of AlphaFold
https://deepmind.google/discover/blog/a-glimpse-of-the-next-generation-of-alphafold/
Performance and structural coverage of the latest, in-development AlphaFold model
https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/a-glimpse-of-the-next-generation-of-alphafold/alphafold_latest_oct2023.pdf
Unfortunately, neither the RoseTTAFold All-Atom preprint nor the DeepMind whitepaper contained any details on the methodology. In addition, at the time I’m writing this, neither group has released the code for their methods. Hopefully, papers with details on the methods will appear soon, along with public code releases. It’s safe to assume that others, such as the OpenFold consortium, who tend to be more forthcoming with their code and methods, are probably working on similar ideas.
Perspective: Like many other areas in AI, DL docking programs began with an initial period of exuberance. The community was excited, and everyone thought the next revolution was imminent. As people started using these methods, they discovered multiple issues that needed to be resolved. We’re not necessarily in the valley of despair, but this is definitely a “hey, wait a second” moment. I’m confident that, with time, these methods will improve. I wouldn’t be surprised to see DL docking methods incorporating ideas from more traditional, physics-inspired approaches. Hopefully, newly developed, unbiased training and test sets and tools like PoseBusters will enable a more rigorous evaluation of docking and scoring methods. With the co-folding approaches in RosettaFold All-Atom and AlphaFold, we’ll have to wait hope for the code to be released so that the community can evaluate the practical utility of these methods.
2. Can We Use AlpaFold2 Structures for Ligand Discovery and Design?
2.1 Experimentally Evaluating AlphaFold2 Structures
Since it took the CASP14 competition by storm in 2020, AlphaFold2 has greatly interested people involved in drug discovery and numerous other fields. In addition to benchmark comparisons with the PDB, there have been several other efforts to evaluate the structural models generated by AF2 experimentally. Rather than simply comparing the atomic coordinates of AF2 structures with corresponding PDB structures, a paper by Terwilliger and colleagues from Lawrence Livermore National Labs compares AF2 structures with reported crystallographic electron density maps. The authors argue that this approach puts less weight on loops and sidechains that are poorly resolved experimentally. They found that prediction accuracy varied across individual structures, and regions with prediction score (pLDDT) > 90 varied by less than 0.6A from the deposited model. They suggest that even inaccurate regions of AF2 structures can provide plausible hypotheses for experimental refinement.
AlphaFold predictions are valuable hypotheses, and accelerate but do not replace experimental structure determination
A paper by McCafferty and coworkers from UT Austin used mass spec data from protein cross-linking experiments to evaluate the ability of AF2 to model intracellular protein conformations. The authors compared experimentally observed distances in cross-linked proteins from eukaryotic cilia with corresponding distances from AF2 structures and found an 86% concordance. In 42% of cases, all distances within the predicted structure were consistent with those observed in cross-linking experiments.
2.2 Generating Multiple Protein Conformations with AlphaFold2
In 2023, there was great interest in AF2's ability to generate multiple relevant protein conformations. A paper by Wayment-Steele and coworkers showed that clustering the multiple sequence alignment (MSA) used by AF2 enabled the program to generate multiple relevant protein conformations.
Predicting multiple conformations via sequence clustering and AlphaFold2
These ideas have spurred additional investigations and stirred up a bit of controversy. A paper by Chakravarty and coworkers from NCBI and NIH examined the performance of AF2 on 93 fold-switching proteins. The authors found that AF2 only identified the switched conformation in 25% of the proteins in the AF2 training set and 14% of proteins not in the training set.
AlphaFold2 has more to learn about protein energy landscapes
Wayment-Steele and coworkers proposed that their clustering of the MSAs captured the coevolution of related proteins. A subsequent preprint from Porter and coworkers at NCBI challenged this assumption and demonstrated that multiple protein conformations could be generated from single sequences.
ColabFold predicts alternative protein structures from single sequences, coevolution unnecessary for AF-cluster
2.3 Docking into AlphaFold2 Structures
After the publication of the AF2 paper and the subsequent release of the code, many groups began experiments to determine whether structures generated by AF2 and related methods could be used for ligand design. The initial results weren’t promising. Díaz-Rovira and coworkers from the Barcelona Supercomputing Center compared virtual screens using protein-crystal structures and structures predicted by AF2 for 11 proteins. The authors found that the average enrichment factor at 1% for the x-ray structures was double that of the AF2 structures.
Are Deep Learning Structural Models Sufficiently Accurate for Virtual Screening? Application of Docking Algorithms to AlphaFold2 Predicted Structures
Holcomb and coworkers from Scripps took a different approach and compared the performance of
AutoDockGPU on AF2 structures with the performance on corresponding crystal structures from the PDBBind set. The authors noted a significant loss in docking accuracy with the AF2 structures. AutoDockGPU generated poses within 2Å of the experiment in 41% of the cases when docking into the crystal structures. This success rate dropped to 17% for the AF2 structures. On a brighter note, the authors reported that the docking success rate for AF2 structures was better than with corresponding apo structures.
Evaluation of AlphaFold2 structures as docking targets
A paper by Karelina and coworkers from Stanford examined the utility of AF2 for modeling the structures of GPCRs. While the authors found that AF2 could model structures and binding pockets with high fidelity, the docking performance of the models was poor. The results of this study were consistent with those in the papers described above. In this case, the success rate for docking into AF2 structures (16%) was less than half of that for experimentally determined structures (48%). As mentioned above, it was encouraging that the docking performance of the AF2 structures was better than that of structures with other ligands bound.
How accurately can one predict drug binding modes using AlphaFold models?
While the results in the papers above aren’t encouraging, all hope may not be lost. In the last week of 2023, there was a paper from Brian Shoichet, Bryan Roth, and coworkers that reported successful prospective virtual screening results with AF2 structures of the sigma2 and 5-HT2A receptors. The odd bit here is that while the AF2 model performed well prospectively, its retrospective performance on prior screens of the same targets wasn’t good. To demonstrate that they got the right answer for the right reason, the authors solved a cryoEM structure of one of the 5-HT2A agonists and found that the docked pose was consistent with the experimental structure. The authors suggest that AF2 structures may sample the underlying manifold of conformations and posit that retrospective screening studies such as those described above may not predict prospective performance.
AlphaFold2 structures template ligand discovery
Many earlier papers describing the use of AF2 structures for docking suggested that performance could be improved by refining the predicted structures. Zhang and coworkers at Schrödinger compared virtual screening performance using holo structures, apo structures, and AF2 structural models. The authors compared virtual screening performance across 27 targets from the DUD-E set and found that the enrichment factor at 1% (EF1%) on AF2 structures (13%) was similar to that for apo structures (11%). However, EF1% increased to 18% when the AF2 structures were refined using induced fit docking.
Benchmarking Refined and Unrefined AlphaFold2 Structures for Hit Discovery
Perspective: The publication of the AF2 paper and subsequent release of the code has sparked work in numerous areas. There are already more than 17,000 references to the original AF2 paper. Protein structure prediction has become an integral component of experimental structural biology. Programs like
Phenix can generate AF2 structures that can subsequently be fit to experimental data. While there is still work to be done, AF2 may be capable of generating ensembles of relevant protein conformations. It’s exciting to think about how this work will progress as we achieve tighter integration between protein structure prediction and physics-based modeling. It currently appears that the jury is still out on the utility of predicted protein structures for drug design. While the results of retrospective evaluations are somewhat disappointing, the recent prospective success from the Shoichet and Roth labs is encouraging.
3. Benchmarking
I spent a lot of time this year ranting about benchmarks. I
highlighted severe flaws in commonly used datasets like
MoleculeNet and the
Therapeutic Data Commons (TDC). In a
second rant, I bemoaned the lack of statistical analysis in most papers comparing ML methods and molecular representations. Fortunately, there were some rays of sunlight within the dark clouds. Here are a few benchmarking papers pushing the field in the right direction.
3.1 Overviews
Two recent papers provide insight into some of the challenges associated with benchmarking. A preprint by Green and coworkers from DeepMirror, provides an excellent overview of the field and some factors that complicate current benchmarking efforts. The authors compared several molecular representations and ML algorithms in evaluating model accuracy and uncertainty. These evaluations highlighted the strengths of different QSAR modeling and ADME prediction methods. Consistent with other papers published in 2023, 2D descriptors performed best for ADME prediction, while Gaussian Process Regression with fingerprints was the method of choice when predicting biological activity.
Current Methods for Drug Property Prediction in the Real World
A paper by Janela and Bajorath outlines several limitations in current benchmarking strategies. The authors used sound statistical methodologies to examine the impact of compound potency value distributions on performance metrics associated with regression models. They found that across several different ML algorithms, there was a consistent relationship between model performance and the activity range of the dataset. These findings enabled the authors to define bounds for prediction accuracy. The method used in this paper should be informative to those designing future benchmarks.
Rationalizing general limitations in assessing and comparing methods for compound potency prediction
3.2 Benchmark Comparisons
Three benchmarking papers stood out for me in 2023. These papers check a couple of critical boxes.
- For the most part, the authors used high-quality datasets. In a couple of cases, the papers included some of the MoleculeNet and TDC datasets. However, when these datasets were used, the authors did additional curation to clean up some dataset errors. It was nice to see the paper by Deng and coworkers (see below) point out the folly of trying to predict endpoints in the MoleculeNet ClinTox and SIDER datasets based on chemical structures.
- The authors used statistical tests (cue hallelujah chorus) to determine where method performance was different and where it wasn’t.
After numerous papers claiming learned representations and foundation models were the current state of the art, it was refreshing to see careful studies showing this is not the case. In all three papers, the best-performing methods used good old fingerprints and 2D descriptors coupled with gradient-boosting or support vector machines (SVM).
A paper from Fang and coworkers at Biogen introduced several new ADME datasets. Unlike most literature benchmarks, which contain data collected from dozens of papers, these experiments were consistently performed by the same people in the same lab. The authors provided prospective comparisons of several widely used ML methods, including random forest, SVM, XGBoost, LightGBM, and message-passing neural networks (MPNNs) on several relevant endpoints, including aqueous solubility, metabolic stability, membrane permeability, and plasma protein binding.
Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective
One of my favorite papers of 2023 provided a tour de force in method comparison. Deng and coworkers from Stony Brook University compared many popular ML algorithms and representations, curated new datasets, and performed statistical analysis on the results. My only complaint about this paper is that there may have been too much information presented. Each figure contains dozens of subfigures, and it’s easy to get overwhelmed and miss the overall message. The authors used some of the MoleculeNet datasets that I’m not fond of, but they also point out some of the limitations of these datasets. Ultimately, this paper provides one of the best comparisons of ML methods published to date. The authors compare fixed representations, such as molecular fingerprints, with representations learned from SMILES strings and molecular graphs and conclude that, in most cases, the fixed representations provide the best performance. Another interesting aspect of this paper was an attempt to establish a relationship between dataset size and the performance of different molecular representations. While fixed representations performed well on smaller datasets, learned representations didn’t become competitive until between 6K and 100K datapoints were available.
A systematic study of key elements underlying molecular property prediction
A preprint by Kamuntavičius and coworkers at Ro5 uses several widely used ADME datasets to examine the performance of molecular representations, both individually and in combination. The authors found that when tested individually, the RDKit 2D descriptors outperformed fingerprints and representations derived from language models. When examining feature combinations, they found that performance was highly dataset dependant.
Benchmarking ML in ADMET predictions: A Focus on Hypothesis Testing Practices
3.3 Dataset Splitting
In many cases, cross-validation of ML models is performed by randomly splitting a dataset into training and test sets. As many have argued, these random splits can provide overly optimistic estimates of model performance. More recently, many groups have moved away from random splits and are using splits that avoid putting the same scaffold in the training and test sets. While this is an improvement, several subtle issues can confound scaffold splits. To better predict prospective performance, several groups have developed alternate methods for splitting molecular datasets.
A preprint by Tossou and coworkers at Valence Labs (now Recursion) examined several approaches to estimating the performance of ML models in a real-world deployment. The authors evaluated the impact of molecular representations, algorithms, and splitting strategies on the generalization abilities of ML models. In another victory for “classic” algorithms, the authors found that Random Forest was the best option for out-of-domain (OOD) classification, regression, and uncertainty calibration. When comparing representations, they found that 2D and 3D descriptors provided the best uncertainty estimates, while fingerprints provided the best generalization. In a comparison of splitting methods, the authors found that scaffold splits provided the best estimates of generalization, while maximum dissimilarity and random splits provided the best uncertainty estimates.
Real-World Molecular Out-Of-Distribution: Specification and Investigation
Diverse datasets used for hit identification differ significantly from the congeneric datasets encountered during lead optimization. As such, we should tailor our splitting strategies to the task. A preprint from Steshin discusses the differences in hit identification (Hi) and lead optimization (Lo) datasets and proposes different benchmarking strategies for each. When employing a scaffold split on a Hi dataset, the author found. that many test set molecules had a neighbor in the training set with an ECFP Tanimoto similarity greater than 0.4. To address this limitation, Integer Linear Programming was used to develop a specific method for splitting. For the Lo benchmarks, a clustering strategy was used to divide the data into training and test sets. The author also provides a GitHub repository with the software for generating the splits and preprocessed versions of common (and unfortunately flawed) benchmarks.
Lo-Hi: Practical ML Drug Discovery Benchmark
As mentioned above, many commonly used dataset-splitting methods allow information from the training set to leak into the test set. When benchmarking docking or activity prediction models, it has been common to split sets of protein-ligand structures from the PDB based on the structure deposition date. Structures deposited before a specific date are used for training, and those deposited after that date are used for validation and/or testing. Unfortunately, this typically results in a test set that contains many structures that are almost identical to those in the training set. Similar issues can impact datasets used for QSAR or ADME modeling. To overcome some of these issues, Joeres and coworkers at the Helmholtz Institute for Pharmaceutical Research Saarland developed Data Splitting Against Information Leakage (DataSAIL). This data-splitting method uses Binary Linear Programming to minimize the overlap between training and test sets. The authors demonstrate that their method scales better than the LoHi splitter, which can bog down when the dataset size approaches 100K.
DataSAIL: Data Splitting Against Information Leakage
As seen above, several methods and software packages exist for molecular dataset splitting. Keeping up with work in the field and installing and learning to apply new methods can be time-consuming. To simplify this process, Burns and coworkers at MIT developed astartes, a Python package that aspires to be the Swiss army knife of dataset splitting. The astartes package currently supports more than a dozen splitting methods using a simple syntax that will be familiar to scikit-learn users.
Machine Learning Validation via Rational Dataset Sampling with astartes
3.4 New Datasets
2023 saw the appearance of a few valuable new benchmark datasets. As mentioned above, the Biogen ADME dataset provides a high-quality, consistently measured collection of ADME datasets that will hopefully become a standard for the field.
Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective
While activity cliffs, where small changes in chemical structure bring about large changes in properties or biological activity, are frequently encountered in drug discovery, they are rarely present in benchmark datasets. A paper by van Tilborg and coworkers from the Eindhoven University of Technology set out to remedy this by creating MoleculeACE, a series of datasets designed to evaluate the performance of ML models on data containing activity cliffs. The authors evaluated a range of ML models and representations and found a high correlation between overall performance and the performance on activity cliffs in 25 of the 30 datasets studied.
Exposing the Limitations of Molecular Machine Learning with Activity Cliffs
Perspective: While there were a few benchmarking papers published in 2023 that followed best practices, there were a lot more that didn't. As a field, we must reach a consensus on appropriate datasets and statistical tests for method comparisons. It's great to see so many groups looking at topics like dataset splitting. In the coming year, I hope statistical approaches to comparing methods receive equal attention.
Comments
Post a Comment