AI in Drug Discovery 2020 - A Highly Opinionated Literature Review
In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020. Please don't be offended if your paper isn't on the list. Leave a comment with other papers you think should be included.
I've tried to organize these papers by topic. Please be aware that the topics, selected papers, and the comments below reflect my own biases. I've endeavored to focus primarily on papers that include source code. Hopefully, this list reflects a few interesting trends I saw this year.
- More of a practical focus on active learning
- Efforts to address model uncertainty, as well as the admission that it's a very difficult problem
- The (re)emergence of molecular representations that incorporate 3D structure
- Several interesting strategies for data augmentation
- Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem
- The application of generative models to more practical objectives (e.g. not LogP and QED)
Reviews, Overviews, and Retrospectives
This category is a catch-all collection of reviews, special issues, and overviews.
An issue of Drug Discovery Today: Technologies dedicated to AI edited by Johannes Kirchmair. Lots of great papers here.
An issue of The Journal of Medicinal Chemistry dedicated to "Artificial Intelligence in Drug Discovery" that I co-edited with Steven Kearns and Jürgen Bajorath.
An overview of methods for molecule generation and chemical space exploration from Connor Coley.
- Coley, C. W. Defining and Exploring Chemical Spaces. Trends in Chemistry 2020. https://doi.org/10.1016/j.trechm.2020.11.004.
A practical review from the team at Bayer on their application of machine learning models in drug discovery programs. Essential reading for anyone applying ML to real-world drug discovery.
- Göller, A. H.; Kuhnke, L.; Montanari, F.; Bonin, A.; Schneckener, S.; Ter Laak, A.; Wichard, J.; Lobell, M.; Hillisch, A. Bayer’s in Silico ADMET Platform: A Journey of Machine Learning over the Past Two Decades. Drug Discov. Today 2020. https://doi.org/10.1016/j.drudis.2020.07.001.
- Minnich, A. J.; McLoughlin, K.; Tse, M.; Deng, J.; Weber, A.; Murad, N.; Madej, B. D.; Ramsundar, B.; Rush, T.; Calad-Thomson, S.; Brase, J.; Allen, J. E. AMPL: A Data-Driven Modeling Pipeline for Drug Discovery. J. Chem. Inf. Model. 2020, 60 (4), 1955–1968. https://doi.org/10.1021/acs.jcim.9b01053.
- Struble, T. J.; Alvarez, J. C.; Brown, S. P.; Chytil, M.; Cisar, J.; DesJarlais, R. L.; Engkvist, O.; Frank, S. A.; Greve, D. R.; Griffin, D. J.; Hou, X.; Johannes, J. W.; Kreatsoulas, C.; Lahue, B.; Mathea, M.; Mogk, G.; Nicolaou, C. A.; Palmer, A. D.; Price, D. J.; Robinson, R. I.; Salentin, S.; Xing, L.; Jaakkola, T.; Green, W. H.; Barzilay, R.; Coley, C. W.; Jensen, K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J. Med. Chem. 2020, 63 (16), 8667–8682. https://doi.org/10.1021/acs.jmedchem.9b02120.
As we move toward real and virtual libraries with more than a billion compounds, it becomes computationally expensive to apply methods like docking to an entire library. Active learning provides strategies for efficient screening of subsets of the library. In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost.
A nice overview of the topic by Daniel Reker.
- Reker, D. Practical Considerations for Active Machine Learning in Drug Discovery. Drug Discov. Today Technol. 2020. https://doi.org/10.1016/j.ddtec.2020.06.001.
An active learning approach to protein-ligand docking. The authors docked 10% of a 1.3 billion molecule library and achieved enrichments of up to 6,000 fold better than random.
- Gentile, F.; Agrawal, V.; Hsing, M.; Ton, A.-T.; Ban, F.; Norinder, U.; Gleave, M. E.; Cherkasov, A. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Cent Sci 2020, 6 (6), 939–949. https://doi.org/10.1021/acscentsci.0c00229.
- Graff, D. E.; Shakhnovich, E. I.; Coley, C. W. Accelerating High-Throughput Virtual Screening through Molecular Pool-Based Active Learning. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2012.07127.
Not active learning, per se, but this paper shows how less expensive methods can be used to approximate docking scores. The techniques here could be easily applied to an active learning approach.
- Jastrzębski, S.; Szymczak, M.; Pocha, A.; Mordalski, S.; Tabor, J.; Bojarski, A. J.; Podlewska, S. Emulating Docking Results Using a Deep Neural Network: A New Perspective for Virtual Screening. J. Chem. Inf. Model. 2020, 60 (9), 4246–4262. https://doi.org/10.1021/acs.jcim.9b01202.
Machine Learning With 3D Representations
The vast majority of machine learning models for QSAR use 1D and/or 2D representations of molecules. Why? Because it's easy and it works. 3D models should work better, but we lack good ways of capturing conformational ensembles in a representation that's amenable to ML.
In order to generate machine learning models that use a 3D description of molecules, we probably need to have conformational ensembles annotated with accurate strain energies. In this paper, the authors generate conformational ensembles annotated with accurate QM energies. A good starting point for future studies.
- Axelrod, S.; Gomez-Bombarelli, R. GEOM: Energy-Annotated Molecular Conformations for Property Prediction and Molecular Generation. arXiv [physics.comp-ph], 2020.
- Axelrod, S.; Gomez-Bombarelli, R. Molecular Machine Learning with Conformer Ensembles. arXiv [cs.LG], 2020.
- McCorkindale, W.; Poelking, C.; Lee, A. A. Investigating 3D Atomic Environments for Enhanced QSAR. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2010.12857.
- Zankov, D. V.; Matveieva, M.; Nikonenko, A.; Nugmanov, R.; Varnek, A.; Polishchuk, P.; Madzhidov, T. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13456277.v1.
A nice overview of methods for estimating uncertainty from the prolific team at AZ.
- Mervin, L. H.; Johansson, S.; Semenova, E.; Giblin, K. A.; Engkvist, O. Uncertainty Quantification in Drug Design. Drug Discov. Today 2020. https://doi.org/10.1016/j.drudis.2020.11.027.
- Alvarsson, J.; Arvidsson McShane, S.; Norinder, U.; Spjuth, O. Predicting With Confidence: Using Conformal Prediction in Drug Discovery. J. Pharm. Sci. 2020. https://doi.org/10.1016/j.xphs.2020.09.055.
The use of Gaussian Processes to estimate model uncertainty.
- Hie, B.; Bryson, B. D.; Berger, B. A. Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. cels 2020, 0 (0). https://doi.org/10.1016/j.cels.2020.09.007.
In this paper, the authors evaluate a number of approaches to evaluating model uncertainty for predictions using learned representations and find that no single method provides a consistent estimation of uncertainty.
- Hirschfeld, L.; Swanson, K.; Yang, K.; Barzilay, R.; Coley, C. W. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. J. Chem. Inf. Model. 2020, 60 (8), 3770–3780. https://doi.org/10.1021/acs.jcim.0c00502.
Data and Data Augmentation
What's the single most important factor when building a machine learning model? Data! Having the right data, and an understanding of the uncertainty in that data often makes the difference between useful and useless models.
A good overview of some of the data related issues we have to deal with when building machine learning models using pharmaceutical data.
- Rodrigues, T. The Good, the Bad, and the Ugly in Chemical and Biological Data for Machine Learning. Drug Discov. Today Technol. 2020. https://doi.org/10.1016/j.ddtec.2020.07.001.
A few useful tricks for making the most of the data you're using to build structure-based models. The paper also presents some useful techniques for understanding the reasoning behind predictions.
- Scantlebury, J.; Brown, N.; von Delft, F.; Deane, C. M. Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalize To Unseen Target Classes, And Highlight Important Binding Interactions. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00263.
The paucity of available data sometimes limits the quality of the models we can build. This paper describes an approach to increasing the size of a training set by integrating random negative examples.
- Cáceres, E. L.; Mew, N. C.; Keiser, M. J. Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00565.
A paper on modeling hERG inhibition from NCATS. The modeling is pretty standard, but the paper includes some new data from a Thallium flux assay that could be useful to those building and validating hERG models.
- Siramshetty, V. B.; Nguyen, D.-T.; Martinez, N. J.; Southall, N. T.; Simeonov, A.; Zakharov, A. V. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the “Big Data” Era. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00884.
In order to provide maximum utility, a machine learning model should be able to explain, as well as predict. Ultimately, we'd like to be able to provide insights that will enable drug discovery teams to make better decisions.
A good overview of model interpretability and some of the associated challenges.
- Jiménez-Luna, J.; Grisoni, F.; Schneider, G. Drug Discovery with Explainable Artificial Intelligence. Nature Machine Intelligence 2020, 2 (10), 573–584. https://doi.org/10.1038/s42256-020-00236-4.
Adding chemical interpretability to neural network models.
- Jimenez-Luna, J.; Skalic, M.; Weskamp, N.; Schneider, G. Coloring Molecules with Explainable Artificial Intelligence for Preclinical Relevance Assessment. https://doi.org/10.26434/chemrxiv.13252286.v1.
This paper was published in 2019, but it's essential reading for anyone interested in model interpretability. Bob presents some systematic examples to assess the impact and validity of methods for highlighting important features.
- Sheridan, R. P. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? J. Chem. Inf. Model. 2019. https://doi.org/10.1021/acs.jcim.8b00825.
- Liu, B.; Udell, M. Impact of Accuracy on Model Interpretations. arXiv [cs.LG], 2020.
- Sanchez-Lengeling, B.; Wei, J.; Lee, B.; Reif, E.; Wang, P. Y.; Qian, W. W.; Mc Closkey, K.; Colwell, L.; Wiltschko, A. Evaluating Attribution for Graph Neural Networks. https://research.google/pubs/pub49909
2020 was the year that generative models exploded. There were dozens of papers published, here are a few that I found interesting.
An application note describing recent work by the team working on the open-source REINVENT package for generative modeling.
- Blaschke, T.; Arús-Pous, J.; Chen, H.; Margreitter, C.; Tyrchan, C.; Engkvist, O.; Papadopoulos, K.; Patronov, A. REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.12058026.v3.
- Jin, W.; Barzilay, R.; Jaakkola, T. Multi-Objective Molecule Generation Using Interpretable Substructures. arXiv [cs.LG], 2020, https://arxiv.org/abs/2002.03244.
- Zhang, J.; Mercado, R.; Engkvist, O.; Chen, H. Comparative Study of Deep Generative Models on Chemical Space Coverage. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13234289.v1.
One of my favorite papers of the year. The GSK group looks at the ability of generative algorithms to reproduce molecules in patents and has medicinal chemists evaluate the output of generative models.
- Bush, J. T.; Pogány, P.; Pickett, S. D.; Barker, M.; Baxter, A.; Campos, S.; Cooper, A. W. J.; Hirst, D. J.; Inglis, G.; Nadin, A.; Patel, V. K.; Poole, D.; Pritchard, J.; Washio, Y.; White, G.; Green, D. A Turing Test for Molecular Generators. J. Med. Chem. 2020. https://doi.org/10.1021/acs.jmedchem.0c01148.
Another paper from the Glaxo group with some practical advice on using recurrent neural networks for molecule generation.
- Amabilino, S.; Pogany, P.; Pickett, S. D.; Green, D. Guidelines for RNN Transfer Learning Based Molecular Generation of Focussed Libraries. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00343.
Many, if not most, papers on generative models use toy scoring functions like CLogP or QED. It's great to see someone trying to optimize something practical. In this paper, the authors use docking scores as an objective function. While the results aren't great, it's a step in the right direction.
- Cieplinski, T.; Danel, T.; Podlewska, S.; Jastrzebski, S. We Should at Least Be Able to Design Molecules That Dock Well. arXiv [q-bio.BM], 2020. https://arxiv.org/abs/2006.16955
- Imrie, F.; Bradley, A. R.; van der Schaar, M.; Deane, C. M. Deep Generative Models for 3D Linker Design. J. Chem. Inf. Model. 2020, 60 (4), 1983–1995. https://doi.org/10.1021/acs.jcim.9b01120.
It's great that your generative model came up with a molecule, but at the end of the day, someone has to synthesize it. This paper looks at different ways of integrating synthesizability criteria into generative models.
- Gao, W.; Coley, C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00174.
- Maziarka, Ł.; Pocha, A.; Kaczmarczyk, J.; Rataj, K.; Danel, T.; Warchoł, M. Mol-CycleGAN: A Generative Model for Molecular Optimization. J. Cheminform. 2020, 12 (1), 2. https://doi.org/10.1186/s13321-019-0404-1.
I covered the CReM method for molecule generation in a previous post. In this paper, Pavel, Polishchuk, the author of CReM, presents ways to tune the synthetic feasibility of molecules generated by his method.
- Polishchuk, P. Control of Synthetic Feasibility of Compounds Generated with CReM. J. Chem. Inf. Model. 2020, 60 (12), 6074–6080. https://doi.org/10.1021/acs.jcim.0c00792.
An interesting method for molecule generation that doesn't require training or a GPU. Pen had a nice recent post on how to use the method. In my hands, the problem with this method is that it generates a lot of very silly molecules. More on this in an upcoming post.
- Nigam, A.; Pollice, R.; Krenn, M.; dos Passos Gomes, G.; Aspuru-Guzik, A. Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules Using SELFIES. ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.13383266.v1.
The vast majority of generative models work on 1D text or 2D graph representations of molecules. This is one of the first to use 3D grids as part of the representation.
- Ragoza, M.; Masuda, T.; Koes, D. R. Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models. arXiv [q-bio.QM], 2020, https://arxiv.org/abs/2010.08687.
This is another catch-all for important methodology related papers.
Gaussian Process Regression (GPR) is a powerful technique for building predictive models. Unfortunately, the method typically scales as N3, where N is the number of molecules, so the application to large datasets can be problematic. This paper from the group at Merck shows how a locally sensitive hashing approach can be used to apply GPR to larger datasets.
- DiFranzo, A.; Sheridan, R. P.; Liaw, A.; Tudor, M. Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00678.
- Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks. arXiv [cs.LG], 2020, https://arxiv.org/abs/2011.13230.
- Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv [cs.LG], 2020, https://arxiv.org/abs/2010.09885.
In numerous posts, I've ranted about the importance of proper validation. Here are a couple of papers that I think are steps in the right direction.
A perspective by Steven Kearnes on Prospective Validation.
- Kearnes, S. Pursuing a Prospective Perspective. arXiv [cs.LG], 2020. https://arxiv.org/abs/2009.00707
- Martin, L. J.; Bowen, M. T. Comparing Fingerprints for Ligand-Based Virtual Screening: A Fast and Scalable Approach for Unbiased Evaluation. J. Chem. Inf. Model. 2020. https://doi.org/10.1021/acs.jcim.0c00469.
- Hawkins, P. C. D.; Wlodek, S. Decisions with Confidence: Application to the Conformation Sampling of Molecules in the Solid State. J. Chem. Inf. Model. 2020, 60 (7), 3518–3533. https://doi.org/10.1021/acs.jcim.0c00358.