AI in Drug Discovery 2020 - A Highly Opinionated Literature Review

In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020. Please don't be offended if your paper isn't on the list. Leave a comment with other papers you think should be included.

I've tried to organize these papers by topic. Please be aware that the topics, selected papers, and the comments below reflect my own biases. I've endeavored to focus primarily on papers that include source code. Hopefully, this list reflects a few interesting trends I saw this year.

More of a practical focus on active learning
Efforts to address model uncertainty, as well as the admission that it's a very difficult problem
The (re)emergence of molecular representations that incorporate 3D structure
Several interesting strategies for data augmentation
Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem
The application of generative models to more practical objectives (e.g. not LogP and QED)

Reviews, Overviews, and Retrospectives

This category is a catch-all collection of reviews, special issues, and overviews.

An issue of Drug Discovery Today: Technologies dedicated to AI edited by Johannes Kirchmair. Lots of great papers here.

https://www.sciencedirect.com/journal/drug-discovery-today-technologies/vol/32/suppl/C

An issue of The Journal of Medicinal Chemistry dedicated to "Artificial Intelligence in Drug Discovery" that I co-edited with Steven Kearns and Jürgen Bajorath.

https://pubs.acs.org/toc/jmcmar/63/16

An overview of methods for molecule generation and chemical space exploration from Connor Coley.

Coley, C. W. Defining and Exploring Chemical Spaces. Trends in Chemistry 2020. https://doi.org/10.1016/j.trechm.2020.11.004.

A practical review from the team at Bayer on their application of machine learning models in drug discovery programs. Essential reading for anyone applying ML to real-world drug discovery.

Göller, A. H.; Kuhnke, L.; Montanari, F.; Bonin, A.; Schneckener, S.; Ter Laak, A.; Wichard, J.; Lobell, M.; Hillisch, A. Bayer’s in Silico ADMET Platform: A Journey of Machine Learning over the Past Two Decades. Drug Discov. Today 2020. https://doi.org/10.1016/j.drudis.2020.07.001.

A description of the AMPL pipeline for ML in drug discovery developed by the group at the ATOM consortium.

Minnich, A. J.; McLoughlin, K.; Tse, M.; Deng, J.; Weber, A.; Murad, N.; Madej, B. D.; Ramsundar, B.; Rush, T.; Calad-Thomson, S.; Brase, J.; Allen, J. E. AMPL: A Data-Driven Modeling Pipeline for Drug Discovery. J. Chem. Inf. Model. 2020, 60 (4), 1955–1968. https://doi.org/10.1021/acs.jcim.9b01053.

A practical review of the role and impact of computer-aided synthesis planning (CASP) in medicinal chemistry from the team at the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium.

Struble, T. J.; Alvarez, J. C.; Brown, S. P.; Chytil, M.; Cisar, J.; DesJarlais, R. L.; Engkvist, O.; Frank, S. A.; Greve, D. R.; Griffin, D. J.; Hou, X.; Johannes, J. W.; Kreatsoulas, C.; Lahue, B.; Mathea, M.; Mogk, G.; Nicolaou, C. A.; Palmer, A. D.; Price, D. J.; Robinson, R. I.; Salentin, S.; Xing, L.; Jaakkola, T.; Green, W. H.; Barzilay, R.; Coley, C. W.; Jensen, K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J. Med. Chem. 2020, 63 (16), 8667–8682. https://doi.org/10.1021/acs.jmedchem.9b02120.

Active Learning

As we move toward real and virtual libraries with more than a billion compounds, it becomes computationally expensive to apply methods like docking to an entire library. Active learning provides strategies for efficient screening of subsets of the library. In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost.

A nice overview of the topic by Daniel Reker.

Reker, D. Practical Considerations for Active Machine Learning in Drug Discovery. Drug Discov. Today Technol. 2020. https://doi.org/10.1016/j.ddtec.2020.06.001.

An active learning approach to protein-ligand docking. The authors docked 10% of a 1.3 billion molecule library and achieved enrichments of up to 6,000 fold better than random.

Gentile, F.; Agrawal, V.; Hsing, M.; Ton, A.-T.; Ban, F.; Norinder, U.; Gleave, M. E.; Cherkasov, A. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Cent Sci 2020, 6 (6), 939–949. https://doi.org/10.1021/acscentsci.0c00229.

A preprint from a team at Harvard and MIT covering another active learning workflow for docking that uncovered 88% of the hits by screening 2.4% of the library.

Graff, D. E.; Shakhnovich, E. I.; Coley, C. W. Accelerating High-Throughput Virtual Screening through Molecular Pool-Based Active Learning. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2012.07127.

Not active learning, per se, but this paper shows how less expensive methods can be used to approximate docking scores. The techniques here could be easily applied to an active learning approach.

Jastrzębski, S.; Szymczak, M.; Pocha, A.; Mordalski, S.; Tabor, J.; Bojarski, A. J.; Podlewska, S. Emulating Docking Results Using a Deep Neural Network: A New Perspective for Virtual Screening. J. Chem. Inf. Model. 2020, 60 (9), 4246–4262. https://doi.org/10.1021/acs.jcim.9b01202.

Machine Learning With 3D Representations

The vast majority of machine learning models for QSAR use 1D and/or 2D representations of molecules. Why? Because it's easy and it works. 3D models should work better, but we lack good ways of capturing conformational ensembles in a representation that's amenable to ML.

In order to generate machine learning models that use a 3D description of molecules, we probably need to have conformational ensembles annotated with accurate strain energies. In this paper, the authors generate conformational ensembles annotated with accurate QM energies. A good starting point for future studies.

Axelrod, S.; Gomez-Bombarelli, R. GEOM: Energy-Annotated Molecular Conformations for Property Prediction and Molecular Generation. arXiv [physics.comp-ph], 2020.

The authors of the paper above use their dataset in conjunction with an ML approach that integrates SchNet with ChemProp. Results on a couple of datasets show performance improvements over 2D methods. One troubling aspect is that the inclusion of multiple conformers doesn't improve the results.

Axelrod, S.; Gomez-Bombarelli, R. Molecular Machine Learning with Conformer Ensembles. arXiv [cs.LG], 2020.

An interesting approach that uses a kernel calculated using SOAP to perform Gaussian process regression (GPR). This paper was good, but there are a couple of areas for improvement. None of the comparisons with other methods had error bars. The authors compared with 2D methods but didn't provide a comparison with GPR using simple Morgan fingerprints. In my hands, GPR with RDKit Morgan2 fingerprints outperformed their reported results.

McCorkindale, W.; Poelking, C.; Lee, A. A. Investigating 3D Atomic Environments for Enhanced QSAR. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2010.12857.

A very interesting application of multiple instance learning in QSAR. Reported results are competitive with, and in some cases better than, 2D.

Zankov, D. V.; Matveieva, M.; Nikonenko, A.; Nugmanov, R.; Varnek, A.; Polishchuk, P.; Madzhidov, T. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13456277.v1.

Uncertainty

A nice overview of methods for estimating uncertainty from the prolific team at AZ.

Mervin, L. H.; Johansson, S.; Semenova, E.; Giblin, K. A.; Engkvist, O. Uncertainty Quantification in Drug Design. Drug Discov. Today 2020. https://doi.org/10.1016/j.drudis.2020.11.027.

An introduction to conformal prediction, a technique that provides estimates of uncertainty as part of the prediction.

Alvarsson, J.; Arvidsson McShane, S.; Norinder, U.; Spjuth, O. Predicting With Confidence: Using Conformal Prediction in Drug Discovery. J. Pharm. Sci. 2020. https://doi.org/10.1016/j.xphs.2020.09.055.

The use of Gaussian Processes to estimate model uncertainty.

Hie, B.; Bryson, B. D.; Berger, B. A. Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. cels 2020, 0 (0). https://doi.org/10.1016/j.cels.2020.09.007.

In this paper, the authors evaluate a number of approaches to evaluating model uncertainty for predictions using learned representations and find that no single method provides a consistent estimation of uncertainty.

Hirschfeld, L.; Swanson, K.; Yang, K.; Barzilay, R.; Coley, C. W. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. J. Chem. Inf. Model. 2020, 60 (8), 3770–3780. https://doi.org/10.1021/acs.jcim.0c00502.

Data and Data Augmentation

What's the single most important factor when building a machine learning model? Data! Having the right data, and an understanding of the uncertainty in that data often makes the difference between useful and useless models.

A good overview of some of the data related issues we have to deal with when building machine learning models using pharmaceutical data.

Rodrigues, T. The Good, the Bad, and the Ugly in Chemical and Biological Data for Machine Learning. Drug Discov. Today Technol. 2020. https://doi.org/10.1016/j.ddtec.2020.07.001.

A few useful tricks for making the most of the data you're using to build structure-based models. The paper also presents some useful techniques for understanding the reasoning behind predictions.

Scantlebury, J.; Brown, N.; von Delft, F.; Deane, C. M. Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalize To Unseen Target Classes, And Highlight Important Binding Interactions. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00263.

The paucity of available data sometimes limits the quality of the models we can build. This paper describes an approach to increasing the size of a training set by integrating random negative examples.

Cáceres, E. L.; Mew, N. C.; Keiser, M. J. Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00565.

Off-Targets

A paper on modeling hERG inhibition from NCATS. The modeling is pretty standard, but the paper includes some new data from a Thallium flux assay that could be useful to those building and validating hERG models.

Siramshetty, V. B.; Nguyen, D.-T.; Martinez, N. J.; Southall, N. T.; Simeonov, A.; Zakharov, A. V. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the “Big Data” Era. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00884.

Model Interpretability

In order to provide maximum utility, a machine learning model should be able to explain, as well as predict. Ultimately, we'd like to be able to provide insights that will enable drug discovery teams to make better decisions.

A good overview of model interpretability and some of the associated challenges.

Jiménez-Luna, J.; Grisoni, F.; Schneider, G. Drug Discovery with Explainable Artificial Intelligence. Nature Machine Intelligence 2020, 2 (10), 573–584. https://doi.org/10.1038/s42256-020-00236-4.

Adding chemical interpretability to neural network models.

Jimenez-Luna, J.; Skalic, M.; Weskamp, N.; Schneider, G. Coloring Molecules with Explainable Artificial Intelligence for Preclinical Relevance Assessment. https://doi.org/10.26434/chemrxiv.13252286.v1.

This paper was published in 2019, but it's essential reading for anyone interested in model interpretability. Bob presents some systematic examples to assess the impact and validity of methods for highlighting important features.

Sheridan, R. P. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? J. Chem. Inf. Model. 2019. https://doi.org/10.1021/acs.jcim.8b00825.

This paper isn't specifically about drug discovery, but it presents some interesting thoughts on the relationship between accuracy and interpretability.

Liu, B.; Udell, M. Impact of Accuracy on Model Interpretations. arXiv [cs.LG], 2020.

An Interesting paper on attribution for graph neural networks from Google Research.

Sanchez-Lengeling, B.; Wei, J.; Lee, B.; Reif, E.; Wang, P. Y.; Qian, W. W.; Mc Closkey, K.; Colwell, L.; Wiltschko, A. Evaluating Attribution for Graph Neural Networks. https://research.google/pubs/pub49909

Generative Models

2020 was the year that generative models exploded. There were dozens of papers published, here are a few that I found interesting.

An application note describing recent work by the team working on the open-source REINVENT package for generative modeling.

Blaschke, T.; Arús-Pous, J.; Chen, H.; Margreitter, C.; Tyrchan, C.; Engkvist, O.; Papadopoulos, K.; Patronov, A. REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.12058026.v3.

A clever approach that employs "rationales" that are somewhat analogous to matched molecular pairs. The approach provides two significant advances: a degree of interpretability and the ability to more cleanly consider multi-objective optimization.

Jin, W.; Barzilay, R.; Jaakkola, T. Multi-Objective Molecule Generation Using Interpretable Substructures. arXiv [cs.LG], 2020, https://arxiv.org/abs/2002.03244.

An alternative to GuacaMol or MOSES for evaluating the chemical space covered by generated molecules. This one looks at the ability of methods to find functional groups present in GDB.

Zhang, J.; Mercado, R.; Engkvist, O.; Chen, H. Comparative Study of Deep Generative Models on Chemical Space Coverage. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13234289.v1.

One of my favorite papers of the year. The GSK group looks at the ability of generative algorithms to reproduce molecules in patents and has medicinal chemists evaluate the output of generative models.

Bush, J. T.; Pogány, P.; Pickett, S. D.; Barker, M.; Baxter, A.; Campos, S.; Cooper, A. W. J.; Hirst, D. J.; Inglis, G.; Nadin, A.; Patel, V. K.; Poole, D.; Pritchard, J.; Washio, Y.; White, G.; Green, D. A Turing Test for Molecular Generators. J. Med. Chem. 2020. https://doi.org/10.1021/acs.jmedchem.0c01148.

Another paper from the Glaxo group with some practical advice on using recurrent neural networks for molecule generation.

Amabilino, S.; Pogany, P.; Pickett, S. D.; Green, D. Guidelines for RNN Transfer Learning Based Molecular Generation of Focussed Libraries. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00343.

Many, if not most, papers on generative models use toy scoring functions like CLogP or QED. It's great to see someone trying to optimize something practical. In this paper, the authors use docking scores as an objective function. While the results aren't great, it's a step in the right direction.

Cieplinski, T.; Danel, T.; Podlewska, S.; Jastrzebski, S. We Should at Least Be Able to Design Molecules That Dock Well. arXiv [q-bio.BM], 2020. https://arxiv.org/abs/2006.16955

An interesting paper where the authors use a generative model for CAVEAT like functionality, also has some clever applications to PROTACs.

Imrie, F.; Bradley, A. R.; van der Schaar, M.; Deane, C. M. Deep Generative Models for 3D Linker Design. J. Chem. Inf. Model. 2020, 60 (4), 1983–1995. https://doi.org/10.1021/acs.jcim.9b01120.

It's great that your generative model came up with a molecule, but at the end of the day, someone has to synthesize it. This paper looks at different ways of integrating synthesizability criteria into generative models.

Gao, W.; Coley, C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00174.

An interesting approach to using generative models to generate close analogs of existing molecules.

Maziarka, Ł.; Pocha, A.; Kaczmarczyk, J.; Rataj, K.; Danel, T.; Warchoł, M. Mol-CycleGAN: A Generative Model for Molecular Optimization. J. Cheminform. 2020, 12 (1), 2. https://doi.org/10.1186/s13321-019-0404-1.

I covered the CReM method for molecule generation in a previous post. In this paper, Pavel, Polishchuk, the author of CReM, presents ways to tune the synthetic feasibility of molecules generated by his method.

Polishchuk, P. Control of Synthetic Feasibility of Compounds Generated with CReM. J. Chem. Inf. Model. 2020, 60 (12), 6074–6080. https://doi.org/10.1021/acs.jcim.0c00792.

An interesting method for molecule generation that doesn't require training or a GPU. Pen had a nice recent post on how to use the method. In my hands, the problem with this method is that it generates a lot of very silly molecules. More on this in an upcoming post.

Nigam, A.; Pollice, R.; Krenn, M.; dos Passos Gomes, G.; Aspuru-Guzik, A. Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules Using SELFIES. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13383266.v1.

The vast majority of generative models work on 1D text or 2D graph representations of molecules. This is one of the first to use 3D grids as part of the representation.

Ragoza, M.; Masuda, T.; Koes, D. R. Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2010.08687.

ML Methodology

This is another catch-all for important methodology related papers.

Gaussian Process Regression (GPR) is a powerful technique for building predictive models. Unfortunately, the method typically scales as N³, where N is the number of molecules, so the application to large datasets can be problematic. This paper from the group at Merck shows how a locally sensitive hashing approach can be used to apply GPR to larger datasets.

DiFranzo, A.; Sheridan, R. P.; Liaw, A.; Tudor, M. Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00678.

A number of recent papers have shown the use of language models to build molecular representations. These two papers show the application of BERT (Bidirectional Encoder Representations from Transformers), a state of the art natural language processing (NLP) model from Google to QSAR modeling.

Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks. arXiv [cs.LG], 2020, https://arxiv.org/abs/2011.13230.

Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv [cs.LG], 2020, https://arxiv.org/abs/2010.09885.

Methodology Comparison

In numerous posts, I've ranted about the importance of proper validation. Here are a couple of papers that I think are steps in the right direction.

A perspective by Steven Kearnes on Prospective Validation.

Kearnes, S. Pursuing a Prospective Perspective. arXiv [cs.LG], 2020. https://arxiv.org/abs/2009.00707

Recent papers have shown that commonly used benchmark datasets contain significant bias. This paper describes a new method for splitting datasets that reduces bias.

Martin, L. J.; Bowen, M. T. Comparing Fingerprints for Ligand-Based Virtual Screening: A Fast and Scalable Approach for Unbiased Evaluation. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00469.

This isn't an ML paper, but it provides a great example of the right way to compare computational methods.

Hawkins, P. C. D.; Wlodek, S. Decisions with Confidence: Application to the Conformation Sampling of Molecules in the Solid State. J. Chem. Inf. Model. 2020, 60 (7), 3518–3533. https://doi.org/10.1021/acs.jcim.0c00358.

Search This Blog

Practical Cheminformatics

AI in Drug Discovery 2020 - A Highly Opinionated Literature Review

Comments

Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

Some Thoughts on Splitting Chemical Datasets

Silly Things Large Language Models Do With Molecules