AI in Drug Discovery 2020 - A Highly Opinionated Literature Review

In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020.   Please don't be offended if your paper isn't on the list.  Leave a comment with other papers you think should be included.

I've tried to organize these papers by topic.  Please be aware that the topics, selected papers, and the comments below reflect my own biases.  I've endeavored to focus primarily on papers that include source code.   Hopefully, this list reflects a few interesting trends I saw this year. 

  • More of a practical focus on active learning
  • Efforts to address model uncertainty, as well as the admission that it's a very difficult problem
  • The (re)emergence of molecular representations that incorporate 3D structure
  • Several interesting strategies for data augmentation
  • Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem
  • The application of generative models to more practical objectives (e.g. not LogP and QED)

Reviews, Overviews, and Retrospectives

This category is a catch-all collection of reviews, special issues, and overviews.  

An issue of Drug Discovery Today: Technologies dedicated to AI edited by Johannes Kirchmair.  Lots of great papers here. 

An issue of The Journal of Medicinal Chemistry dedicated to "Artificial Intelligence in Drug Discovery" that I co-edited with Steven Kearns and Jürgen Bajorath. 

An overview of methods for molecule generation and chemical space exploration from Connor Coley.

A practical review from the team at Bayer on their application of machine learning models in drug discovery programs.  Essential reading for anyone applying ML to real-world drug discovery. 

  • Göller, A. H.; Kuhnke, L.; Montanari, F.; Bonin, A.; Schneckener, S.; Ter Laak, A.; Wichard, J.; Lobell, M.; Hillisch, A. Bayer’s in Silico ADMET Platform: A Journey of Machine Learning over the Past Two Decades. Drug Discov. Today 2020.
A description of the AMPL pipeline for ML in drug discovery developed by the group at the ATOM consortium
  • Minnich, A. J.; McLoughlin, K.; Tse, M.; Deng, J.; Weber, A.; Murad, N.; Madej, B. D.; Ramsundar, B.; Rush, T.; Calad-Thomson, S.; Brase, J.; Allen, J. E. AMPL: A Data-Driven Modeling Pipeline for Drug Discovery. J. Chem. Inf. Model. 2020, 60 (4), 1955–1968.
A practical review of the role and impact of computer-aided synthesis planning (CASP) in medicinal chemistry from the team at the  Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium.
  • Struble, T. J.; Alvarez, J. C.; Brown, S. P.; Chytil, M.; Cisar, J.; DesJarlais, R. L.; Engkvist, O.; Frank, S. A.; Greve, D. R.; Griffin, D. J.; Hou, X.; Johannes, J. W.; Kreatsoulas, C.; Lahue, B.; Mathea, M.; Mogk, G.; Nicolaou, C. A.; Palmer, A. D.; Price, D. J.; Robinson, R. I.; Salentin, S.; Xing, L.; Jaakkola, T.; Green, W. H.; Barzilay, R.; Coley, C. W.; Jensen, K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J. Med. Chem. 2020, 63 (16), 8667–8682.
Active Learning

As we move toward real and virtual libraries with more than a billion compounds, it becomes computationally expensive to apply methods like docking to an entire library.  Active learning provides strategies for efficient screening of subsets of the library.  In many cases, we can identify a large portion of the most promising molecules with a fraction of the compute cost. 

A nice overview of the topic by Daniel Reker. 

An active learning approach to protein-ligand docking.  The authors docked 10% of a 1.3 billion molecule library and achieved enrichments of up to 6,000 fold better than random. 

  • Gentile, F.; Agrawal, V.; Hsing, M.; Ton, A.-T.; Ban, F.; Norinder, U.; Gleave, M. E.; Cherkasov, A. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Cent Sci 2020, 6 (6), 939–949.
A preprint from a team at Harvard and MIT covering another active learning workflow for docking that uncovered 88% of the hits by screening 2.4% of the library. 
  • Graff, D. E.; Shakhnovich, E. I.; Coley, C. W. Accelerating High-Throughput Virtual Screening through Molecular Pool-Based Active Learning. arXiv [q-bio.QM], 2020,

Not active learning, per se, but this paper shows how less expensive methods can be used to approximate docking scores.  The techniques here could be easily applied to an active learning approach. 

  • Jastrzębski, S.; Szymczak, M.; Pocha, A.; Mordalski, S.; Tabor, J.; Bojarski, A. J.; Podlewska, S. Emulating Docking Results Using a Deep Neural Network: A New Perspective for Virtual Screening. J. Chem. Inf. Model. 2020, 60 (9), 4246–4262.

Machine Learning With 3D Representations

The vast majority of machine learning models for QSAR use 1D and/or 2D representations of molecules. Why? Because it's easy and it works.  3D models should work better, but we lack good ways of capturing conformational ensembles in a representation that's amenable to ML. 

In order to generate machine learning models that use a 3D description of molecules, we probably need to have conformational ensembles annotated with accurate strain energies.  In this paper, the authors generate conformational ensembles annotated with accurate QM energies.  A good starting point for future studies. 

  • Axelrod, S.; Gomez-Bombarelli, R. GEOM: Energy-Annotated Molecular Conformations for Property Prediction and Molecular Generation. arXiv [physics.comp-ph], 2020.
The authors of the paper above use their dataset in conjunction with an ML approach that integrates SchNet with ChemProp.  Results on a couple of datasets show performance improvements over 2D methods.  One troubling aspect is that the inclusion of multiple conformers doesn't improve the results. 

  • Axelrod, S.; Gomez-Bombarelli, R. Molecular Machine Learning with Conformer Ensembles. arXiv [cs.LG], 2020.
An interesting approach that uses a kernel calculated using SOAP to perform Gaussian process regression (GPR).  This paper was good, but there are a couple of areas for improvement.   None of the comparisons with other methods had error bars.  The authors compared with 2D methods but didn't provide a comparison with GPR using simple Morgan fingerprints.  In my hands, GPR with RDKit Morgan2 fingerprints outperformed their reported results. 

A very interesting application of multiple instance learning in QSAR.  Reported results are competitive with, and in some cases better than, 2D.  
  • Zankov, D. V.; Matveieva, M.; Nikonenko, A.; Nugmanov, R.; Varnek, A.; Polishchuk, P.; Madzhidov, T. QSAR Modeling Based on Conformation Ensembles Using a Multi-Instance Learning Approach. ChemRxiv, 2020.


A nice overview of methods for estimating uncertainty from the prolific team at AZ. 

An introduction to conformal prediction, a technique that provides estimates of uncertainty as part of the prediction. 

The use of Gaussian Processes to estimate model uncertainty. 

In this paper, the authors evaluate a number of approaches to evaluating model uncertainty for predictions using learned representations and find that no single method provides a consistent estimation of uncertainty.  

  • Hirschfeld, L.; Swanson, K.; Yang, K.; Barzilay, R.; Coley, C. W. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. J. Chem. Inf. Model. 2020, 60 (8), 3770–3780.

Data and Data Augmentation

What's the single most important factor when building a machine learning model? Data!  Having the right data, and an understanding of the uncertainty in that data often makes the difference between useful and useless models.

A good overview of some of the data related issues we have to deal with when building machine learning models using pharmaceutical data. 

A few useful tricks for making the most of the data you're using to build structure-based models.  The paper also presents some useful techniques for understanding the reasoning behind predictions. 

  • Scantlebury, J.; Brown, N.; von Delft, F.; Deane, C. M. Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalize To Unseen Target Classes, And Highlight Important Binding Interactions. J. Chem. Inf. Model. 2020.

The paucity of available data sometimes limits the quality of the models we can build.  This paper describes an approach to increasing the size of a training set by integrating random negative examples. 

  • Cáceres, E. L.; Mew, N. C.; Keiser, M. J. Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction. J. Chem. Inf. Model. 2020.


A paper on modeling hERG inhibition from NCATS.  The modeling is pretty standard, but the paper includes some new data from a Thallium flux assay that could be useful to those building and validating hERG models.

  • Siramshetty, V. B.; Nguyen, D.-T.; Martinez, N. J.; Southall, N. T.; Simeonov, A.; Zakharov, A. V. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the “Big Data” Era. J. Chem. Inf. Model. 2020.

Model Interpretability

In order to provide maximum utility, a machine learning model should be able to explain, as well as predict.  Ultimately, we'd like to be able to provide insights that will enable drug discovery teams to make better decisions. 

A good overview of model interpretability and some of the associated challenges. 

Adding chemical interpretability to neural network models.

This paper was published in 2019, but it's essential reading for anyone interested in model interpretability.  Bob presents some systematic examples to assess the impact and validity of methods for highlighting important features. 

This paper isn't specifically about drug discovery, but it presents some interesting thoughts on the relationship between accuracy and interpretability. 
An Interesting paper on attribution for graph neural networks from Google Research. 
  • Sanchez-Lengeling, B.; Wei, J.; Lee, B.; Reif, E.; Wang, P. Y.; Qian, W. W.; Mc Closkey, K.; Colwell, L.; Wiltschko, A. Evaluating Attribution for Graph Neural Networks.

Generative Models

2020 was the year that generative models exploded.  There were dozens of papers published, here are a few that I found interesting. 

An application note describing recent work by the team working on the open-source REINVENT  package for generative modeling. 

  • Blaschke, T.; Arús-Pous, J.; Chen, H.; Margreitter, C.; Tyrchan, C.; Engkvist, O.; Papadopoulos, K.; Patronov, A. REINVENT 2.0 – an AI Tool for De Novo Drug Design. ChemRxiv, 2020.
A clever approach that employs "rationales" that are somewhat analogous to matched molecular pairs.  The approach provides two significant advances:  a degree of interpretability and the ability to more cleanly consider multi-objective optimization. 
  • Jin, W.; Barzilay, R.; Jaakkola, T. Multi-Objective Molecule Generation Using Interpretable Substructures. arXiv [cs.LG], 2020,

An alternative to GuacaMol or MOSES for evaluating the chemical space covered by generated molecules.  This one looks at the ability of methods to find functional groups present in GDB. 

One of my favorite papers of the year.  The GSK group looks at the ability of generative algorithms to reproduce molecules in patents and has medicinal chemists evaluate the output of generative models. 

  • Bush, J. T.; Pogány, P.; Pickett, S. D.; Barker, M.; Baxter, A.; Campos, S.; Cooper, A. W. J.; Hirst, D. J.; Inglis, G.; Nadin, A.; Patel, V. K.; Poole, D.; Pritchard, J.; Washio, Y.; White, G.; Green, D. A Turing Test for Molecular Generators. J. Med. Chem. 2020.

Another paper from the Glaxo group with some practical advice on using recurrent neural networks for molecule generation. 

  • Amabilino, S.; Pogany, P.; Pickett, S. D.; Green, D. Guidelines for RNN Transfer Learning Based Molecular Generation of Focussed Libraries. J. Chem. Inf. Model. 2020.

Many, if not most, papers on generative models use toy scoring functions like CLogP or QED.  It's great to see someone trying to optimize something practical.   In this paper, the authors use docking scores as an objective function.  While the results aren't great, it's a step in the right direction. 

  • Cieplinski, T.; Danel, T.; Podlewska, S.; Jastrzebski, S. We Should at Least Be Able to Design Molecules That Dock Well. arXiv [q-bio.BM], 2020.
An interesting paper where the authors use a generative model for CAVEAT like functionality,  also has some clever applications to PROTACs. 

It's great that your generative model came up with a molecule, but at the end of the day, someone has to synthesize it.  This paper looks at different ways of integrating synthesizability criteria into generative models. 

An interesting approach to using generative models to generate close analogs of existing molecules. 
  • Maziarka, Ł.; Pocha, A.; Kaczmarczyk, J.; Rataj, K.; Danel, T.; Warchoł, M. Mol-CycleGAN: A Generative Model for Molecular Optimization. J. Cheminform. 2020, 12 (1), 2.

I covered the CReM method for molecule generation in a previous post.  In this paper, Pavel, Polishchuk, the author of CReM, presents ways to tune the synthetic feasibility of molecules generated by his method. 

An interesting method for molecule generation that doesn't require training or a GPU.  Pen had a nice recent post on how to use the method.  In my hands, the problem with this method is that it generates a lot of very silly molecules.  More on this in an upcoming post. 

  • Nigam, A.; Pollice, R.; Krenn, M.; dos Passos Gomes, G.; Aspuru-Guzik, A. Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules Using SELFIES. ChemRxiv, 2020.

The vast majority of generative models work on 1D text or 2D graph representations of molecules.  This is one of the first to use 3D grids as part of the representation. 

  • Ragoza, M.; Masuda, T.; Koes, D. R. Learning a Continuous Representation of 3D Molecular Structures with Deep Generative Models. arXiv [q-bio.QM], 2020,

ML Methodology

This is another catch-all for important methodology related papers. 

Gaussian Process Regression (GPR) is a powerful technique for building predictive models.  Unfortunately, the method typically scales as N3, where N is the number of molecules, so the application to large datasets can be problematic.  This paper from the group at Merck shows how a locally sensitive hashing approach can be used to apply GPR to larger datasets. 

  • DiFranzo, A.; Sheridan, R. P.; Liaw, A.; Tudor, M. Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2020.
A number of recent papers have shown the use of language models to build molecular representations.  These two papers show the application of BERT (Bidirectional Encoder Representations from Transformers), a state of the art natural language processing (NLP) model from Google to QSAR modeling. 

  • Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks. arXiv [cs.LG], 2020,
  • Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv [cs.LG], 2020,

Methodology Comparison

In numerous posts, I've ranted about the importance of proper validation.  Here are a couple of papers that I think are steps in the right direction. 

A perspective by Steven Kearnes on Prospective Validation. 

Recent papers have shown that commonly used benchmark datasets contain significant bias. This paper describes a new method for splitting datasets that reduces bias.

  • Martin, L. J.; Bowen, M. T. Comparing Fingerprints for Ligand-Based Virtual Screening: A Fast and Scalable Approach for Unbiased Evaluation. J. Chem. Inf. Model. 2020.

This isn't an ML paper, but it provides a great example of the right way to compare computational methods. 
  • Hawkins, P. C. D.; Wlodek, S. Decisions with Confidence: Application to the Conformation Sampling of Molecules in the Solid State. J. Chem. Inf. Model. 2020, 60 (7), 3518–3533.


  1. Very nice list. FWIW I would like to suggest our (yes, shameless promotion) paper introducing quality-diversity to generative design (including source code):!divAbstract


Post a Comment

Popular posts from this blog

Dissecting the Hype With Cheminformatics

Wicked Fast Cheminformatics with NVIDIA RAPIDS