AI in Drug Discovery 2022 - A Highly Opinionated Literature Review

Here’s a roundup of some of the papers I found interesting in 2022. This list is heavily slanted to my interests, which lean toward the application of machine learning (ML) in drug design.  I’ve added commentary to most of the papers to explain why I found them compelling.  I’ve done my best to arrange the papers according to themes.  If I omitted a paper, please let me know.  I’d be happy to update this summary.   This review ended up being longer than I had anticipated, and there are several topics I didn’t cover.  If I have some time, this post may get a sequel. 


1. Are Deep Neural Networks Better for QSAR?
2. Deep Learning Methods Provide New Approaches to Protein-Ligand Docking
3. Protein Structure Prediction - Pushing AlphaFold2 in New Directions
4. Model Interpretability
5. QM Methods
6. Utralarge Chemical Libraries
7. Active Learning
8. Molecular Representation

1. Are Deep Neural Networks Better for QSAR?  

Based on papers I read and reviewed in 2022, there seems to be a perception that Deep Neural Networks have become ubiquitous in QSAR modeling.  In fact, the introduction to a recent special issue of JCIM contained this statement.    

“However, after the Kaggle Merck Molecular Activity Challenge 2013  and the Tox21 Data Challenge 2015, DNNs have emerged as the method of choice for QSAR applications in drug discovery.”  

While I’ve found DNNs useful in some circumstances, I don’t believe they’ve become “the method of choice.”  Many of the literature examples I’ve seen where deep neural networks (DNNs) outperform more traditional methods like random forest (RF) involve very large datasets containing tens of thousands of molecules.  Here are some historical examples for context. 

Analyzing Learned Molecular Representations for Property Prediction

Improvement in ADMET Prediction with Multitask Deep Featurization

A few papers published in 2022 provided interesting comparisons between DNNs and other methods. At the end of the day, it’s difficult to call a clear winner.  A paper from AstraZeneca evaluated the ability of several machine learning methods, partial least squares (PLS), random forest regression (RF), support vector regression (SVR), and gradient boosted trees (XGBoost), to predict non-additive SAR.  The authors used a method previously published by Kramer to identify matched molecular pairs with additive and non-additive SAR.  The pairs were then used to construct subsets of varying predictive difficulty.  The analysis considered datasets for four assays, LogD, solubility in DMSO, clearance in liver microsomes, and permeability (cell line not specified).  In most cases, the DNN outperformed the other methods on both the additive and non-additive subsets. 

Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models

One of my favorite papers of 2022 was published by a group from Eindhoven University of Technology.  The intent of this paper was somewhat similar to that of the paper above.  The authors constructed datasets containing activity cliffs and evaluated the ability of ML models to predict the activity of these realistic but challenging sets.  The analysis included many traditional ML methods, including RF, gradient boosting machines (GBM), SVR, and kNN.  To benchmark both algorithms and representations, the authors coupled several fingerprints and descriptor sets with the algorithms mentioned earlier.  In addition, the authors included a wide array of DNN methods, including message-passing neural networks (MPNN), graph convolutional neural networks (GCN), graph attention transformers (GAT), and attentive fingerprints (AFP).  The methods were assessed based on their ability to predict bioactivity using 30 datasets from ChEMBL.  The authors reported the root mean squared error (RMSE) and RMSEcliff, calculated over activity cliff molecules.  In a result that some may find surprising, the authors reported that SVM, GBM, and RF coupled with ECFPs provided the best performance on both the overall and the activity cliff datasets. 

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

Perspective - These papers were important for two reasons.  First, they helped to dispel the perception that DNNs are always the best choice for QSAR models.  Second, they introduced new, more realistic benchmark datasets and strategies that will hopefully supplant some of the flawed benchmarks in everyday use. 

2. Deep Learning Methods Provide New Approaches to Protein-Ligand Docking

In 2022 we saw the advent of a new approach to protein-ligand docking.  Until recently, most docking programs used an empirical or physics-based scoring function to search binding poses within a predefined protein region, typically defined by a box around the binding site.  2022 saw the emergence of a new generation of docking programs that use the structures of existing protein-ligand complexes to learn relationships between ligands and protein binding sites and search the entire protein surface.  In essence, these programs are simultaneously solving two problems; identifying a binding site and determining the docked pose of a ligand.  

One of the first examples of this approach was the EquiBind docking program from MIT, which uses deep learning to align sets of points on a protein with corresponding points on a ligand.  

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

A group from Galixir Technologies extended this approach with a docking program called TANKBind that evaluates docking poses across multiple sites on the protein and chooses the highest-scoring pose. 

TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction

Researchers from the Mila - Quebec AI Institute and the University of Montreal published a preprint describing E3Bind, a docking approach inspired by AlphaFold2.  This approach utilizes a combination of three embeddings describing a protein graph, a ligand graph, and a protein-ligand graph which is iteratively refined to generate docking poses. 

E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking

The team that created Equibind has developed a new generative approach to solve the docking problem. Their method, DiffDock, uses diffusion to iteratively search a space of translations, rotations, and torsional variations. The search process in DiffDock is guided by a novel confidence score that enables a choice between multiple poses. 

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Graph Neural Networks (GNNs) have gained widespread use in various subfields of drug discovery. One example is the MedusaGraph method, which uses GNNs to predict protein-ligand docking. This method involves two GNNs: the Pose Prediction GNN, which suggests possible binding poses, and the Pose Selection GNN, which performs binary classification to evaluate the quality of a given pose.

Predicting Protein–Ligand Docking Structure with Graph Neural Network

Perspective - For almost 30 years, docking has been used in structure-based drug design. Until recently, there have been few changes to how docking algorithms work. These algorithms usually generate a set of poses, which are then evaluated using scoring functions that combine physics-based and empirical terms. Recently, advances in deep neural networks (DNNs) have led to the development of new docking programs that use data from the Protein Data Bank (PDB) to train functions that can identify binding sites and propose and evaluate binding poses. While these methods are powerful, they can be difficult to benchmark because it is hard to know if the method is discovering new interactions or simply transferring information from similar binding sites and ligands. It will be interesting to see how these methods perform when faced with novel ligands and binding sites that have not been seen before. There is also the possibility of combining these approaches with existing physics-based and empirical methods.

3. Protein Structure Prediction - Pushing AlphaFold2 in New Directions

Without a doubt, the highest-profile application of ML in 2021 was AlphaFold2.  The Deepmind group stunned the protein structure prediction (PSP) world by dominating the CASP14 protein structure prediction challenge.  Following the publication of the original AlphaFolld2 paper, the field has exploded.  Every week, we see another paper describing [insert name of] Fold.  I’m sure someone (not me) will write an entire review on the advances in PSP during 2022.  Instead, I’ll focus on a few specific applications of PSP to molecular modeling. 

Several papers have shown that one can generate multiple protein conformations by modifying the multiple sequence alignment (MSA) that AlphaFold2 uses as an initial step in structure generation.  In a preprint from late 2022, Wayment-Steele described an approach that clustered the MSA by sequence similarity and used the clusters to generate multiple protein conformational states.  An implementation of this method, known as AF-Cluster, is available on GitHub. 

Prediction of multiple conformational states by combining sequence clustering with AlphaFold2

While several groups have demonstrated the biological relevance of the alternate protein conformations generated by AlphaFold2, more work must be done to assess the energetics and relative populations of these conformational ensembles.  A recent paper from the Tiwary group addresses this need by using ML-augmented molecular dynamics to generate a Boltzmann-weighted ensemble of protein conformations.  

AlphaFold2-RAVE: From sequence to Boltzmann ensemble

When presented with the AlpaFold2 results from CASP14, one of the first questions computational chemists asked was whether structures generated by PSP could be used for molecular modeling.  While we don’t have definitive answers, a few papers provide some necessary first steps.  

Free Energy Perturbation (FEP) calculations have become a mainstay of lead optimization efforts.  The prevailing wisdom is that a high-resolution cocrystal structure is necessary to achieve a good correlation between predicted and experimental binding affinity.  A paper by Beuming and coworkers challenges this assumption by examining the utility of substituting an AlphaFold2 structure for an x-ray structure in FEP calculations. 

Are Deep Learning Structural Models Sufficiently Accurate for Free-Energy Calculations? Application of FEP+ to AlphaFold2-Predicted Structures 

Based on the paper above, it appears that structures from AlphaFold2 can have some utility in FEP calculations, where molecular dynamics simulations allow for some sidechain rearrangement. However, when comparing AlphaFold2 structures to protein crystal structures for docking, a group at Scripps found that AlphaFold2 structures do not have the necessary resolution in the side chains for accurate docking calculations.  Perhaps not surprisingly, the success rate when docking into AlphaFold2 structures (17%) was less than the success rate docking into holo structures (41%).   While this result was less than spectacular, it was considerably better than the docking success rate (10%) the authors achieved with apo x-ray structures. 

Evaluation of AlphaFold2 structures as docking targets

In many therapeutic areas, including oncology and genetic disease.  The ability to understand the structural impact of missense mutations can facilitate the design of therapeutics.  Over the last year or two, there have been differing views on the ability of AlphaFold2 to reliably model missense mutations.  A group from the NCI compared AlphaFold2 structures with x-ray structures for three systems where x-ray structures for the wild-type protein were available and specific structure-disrupting mutations existed.  AlphaFold2 predicted similar structures for mutant and WT in all three cases and failed to recognize the structure-disrupting mutation. 

Can AlphaFold2 predict the impact of missense mutations on structure?

This finding is consistent with a 2021 paper from the Skolkovo Institute of Science and Technology, which found that AlphaFold2 structures did not reproduce experimentally observed changes in protein stability or fluorescence associated with single mutations. 

Using AlphaFold to predict the impact of single mutations on protein stability and function

However, all may not be lost.  A recent preprint from the Baker group provides promising evidence that RosettaFold can predict the structural impact of protein mutations. 

Accurate Mutation Effect Prediction using RoseTTAFold

Perspective - 2022 was the year of AlphaFold in PSP.  While the team from DeepMind didn’t compete in CASP15, all of the best-performing entries were variants on AlphaFold2.  It’s been great to see how AlphaFold2 has been extended and applied to a wide variety of problems in drug discovery.  I look forward to seeing what develops in 2023. 

4. Model Interpretability

While ML models can efficiently select and prioritize molecules for synthesis or purchase, most models operate as “black boxes” that take chemical structures as input and generate predictions as output.  Ideally, we’d like to have interpretable models that provide insights to motivate the design of subsequent compounds.  Interpretable models could provide several advantages. 
  • Engender confidence among the team.  Experimentalists will be much more likely to “buy in” if they understand the reasoning behind a prediction.  
  • Enable the debugging of models.  If we understand the reasoning behind a prediction, we may be able to make adjustments to improve the model. 
  • Facilitate an understanding of the underlying science.  Ultimately we would like to understand the links between chemical structure and a physical or biological endpoint.  An interpretable model could potentially help to illuminate the underlying physical processes.
A recent paper from Sanofi-Aventis and Matthias Rarey provides an extensive overview of the application of explainable artificial intelligence (XAI) to lead optimization datasets.  Several XAI methods are demonstrated, along with a heatmap visualization highlighting features that are critical for activity. 

Interpretation of Structure−Activity Relationships in Real-World Drug Design Data Sets Using Explainable Artificial Intelligence

Some of my favorite work this year came from Andrew White’s group at Rochester.  In one paper, they used counterfactuals, a technique that has been used to explain machine learning models in several areas, including credit risk assessment.  

Model agnostic generation of counterfactual explanations for molecules

In a second paper, the White group used a language model to create text-based explanations for machine learning model predictions.  While the approach is interesting, the examples in the paper are not quite at a level that would convince a medicinal chemist. 

Explaining molecular properties with natural language

Finally, the White group published a perspective preprint, released at the end of 2022, which provides a comprehensive overview of interpretable models in QSAR and other fields.  

A Perspective On Explanations Of Molecular Prediction Models

Shapley values have been used to assess the contributions of features in machine learning models.  Two recent papers from the Bajorath group demonstrate how this technique can be applied to machine learning for molecules. 

EdgeSHAPer: Bond-centric Shapley value-based explanation method for graph neural networks

Calculation of exact Shapley values for support vector machines with Tanimoto kernel enables model interpretation

To effectively associate chemical structure with ML model predictions, we need software tools that will enable us to visualize the mapping of model predictions onto chemical structures.  A paper from Bayer describes an open-source tool for interpreting ML models and visualizing atomic contributions. 

ChemInformatics Model Explorer (CIME): exploratory analysis of chemical model explanations

Perspective - While model interpretability has become a component of several research efforts, we have yet to arrive at readily actionable models.  Hopefully, as the field progresses, we’ll reach a point where insights from interpretable models will provide clear directions for optimization. 

5. QM Methods 

Over the past few years, several groups have developed machine learning methods to rapidly reproduce quantum chemical potentials.  While these methods were scientifically interesting, their practical application was somewhat limited.  A new package, Auto3D, from the Isayev group at Carnegie Mellon, could change this situation by making learned quantum chemical potentials very easy to use.  Auto3D accepts SMILES as input, generates and evaluates ensembles of 3D conformations, and provides an energetic ranking for tautomers and stereoisomers.

Auto3D: Automatic Generation of the Low-Energy 3D Structures with ANI Neural Network Potentials.

Perspective - While QM methods are a bit outside my wheelhouse, I was excited to see an implementation that made it easy to perform critical calculations.

6. Utralarge Chemical Libraries

One of the biggest game changers in virtual screening has been the availability of synthesis on-demand libraries like Enamine REAL, WuXi GalaXi, Otava CHEMryia, and eMolecules eXplore.  These libraries, consisting of billions of molecules, which are available for rapid (a few weeks) deliveries at a reasonable cost, have caused many of us to rethink our approaches to virtual screening.  Brute force approaches, applied to datasets containing millions of molecules, are no longer relevant when considering libraries containing tens of billions. 

I’d recommend watching the videos from the NIH Symposium on Ultra-large Chemical Libraries for those looking for a good overview of the field.  I realize this symposium occurred in December 2020, but many people may have missed it.  If you only have time to watch one talk from this symposium, check out the one from Roger Sayle.    

NIH Symposium on Ultra-large Chemical Libraries                                                                      

Wendy Warr published a detailed set of notes covering the NIH symposium. 

In addition, this collaborative paper from several presenters covers much of the work presented at the NIH symposium and provides an excellent overview of the field.  
Exploration of Ultralarge Compound Collections for Drug Discovery

As the sizes of chemical libraries get into the tens of billions, simple tasks like determining which molecules are common to two libraries become cumbersome and time-consuming.  Rather than performing pairwise comparisons of billions of product molecules, one can compare the chemical building blocks used to construct the libraries.  However, since multiple chemical routes can lead to the same set of products, it’s crucial to employ a method that uses an appropriate fragmentation strategy.  A paper from Matthis Rarey’s group used such a fragmentation strategy to develop a software tool called SpaceCompare which they subsequently used to compare the REAL, GalaXi, and CHEMyria databases.  One surprising conclusion from this work is the very low overlap between molecules in the three databases.  The largest overlap between any two databases was less than 2% of the total. 

Comparison of Combinatorial Fragment Spaces and Its Application to Ultralarge Make-on-Demand Compound Catalogs

The seemingly simple calculation of physical properties can be impractical when dealing with billions of molecules.  Another recent publication from the Rarey group describes a method called SpaceProp that derives property distributions of large enumerated sets from the properties of the constituent topological fragments. 

Calculating and Optimizing Physicochemical Property Distributions of Large Combinatorial Fragment Spaces

Perspective - Over the last five years, we’ve seen the number of commercially available molecules grow from 1 billion to more than 60 billion.  To keep up, the field needs to develop new methods that allow structure-based or ligand-based searches of these ultra-large libraries.  The work published in 2022 was a good start, but there’s much more to do. 

7. Active Learning

Active learning is an iterative technique that enables researchers to search through large spaces efficiently.  This approach uses a machine learning model to select and label datapoints and explore a particular chemical space.  In earlier work by Yang, Berenger,  and others, active learning was used to direct docking calculations with large chemical libraries.  The active learning process begins with an initial sample that can be selected at random or through some other means, such as clustering.  The initially sampled subset of molecules is then docked, and the chemical structures and docking scores for the molecules are used to train a machine learning model.  The machine learning model is used to generate predictions for the more extensive database, and the model predictions are used to select the next set of molecules to dock.  After a few iterations, the active learning process identifies the molecules to be carried to the next step.  

While machine learning is orders of magnitude faster than methods like docking, inference on a table with billions of rows is non-trivial.  This paper by the Coley group at MIT provides a method of pruning a large dataset and efficiently reducing the number of molecules to be predicted by a machine learning algorithm. 

Self-Focusing Virtual Screening with Active Design Space Pruning

Free energy perturbation (FEP) calculations have become a mainstay of computationally driven structure-based drug discovery programs.  While these calculations are powerful, they are also computationally expensive, with a single calculation taking several hours to complete.  One way of overcoming these computational limitations is to use active learning coupled with FEP to search through large chemical libraries. Following on the heels of a 2019 paper by Konze, several groups have explored the application of active learning to free energy calculations.  

Chemical Space Exploration with Active Learning and Alchemical Free Energies

Active Learning Guided Drug Design Lead Optimization Based on Relative Binding Free Energy Modeling

Optimizing active learning for free energy calculations

Another aspect of active learning that has received some attention is the sampling strategy used to select molecules.  A few groups have reported alternative approaches to guide selections.  A team from Exscientia published a method known as Coverage Score that uses Bayesian optimization and information entropy to balance exploration and exploitation in the active learning process.  

Coverage Score: A Model Agnostic Method to Efficiently Explore Chemical Space

One challenge in the early stages of drug discovery is deciding which compounds to progress based on somewhat noisy primary assay data.  A team from the University of Cambridge published a Bayesian active learning approach that considers the inherent noise in assay data. 

Batched Bayesian Optimization for Drug Design in Noisy Environments

Perspective - The advent of synthesis on-demand libraries like Enamine REAL, WuXi GalaXi, and Otava’s CHEMriya have expanded the scope of virtual screening.  Unfortunately,  even with the availability of inexpensive cloud computing resources, virtual screens with billions of molecules can be quite expensive.  Active learning provides an efficient method of docking ultra-large databases.  In addition, active learning can enable FEP calculations on libraries of thousands of molecules.  As work on these techniques progresses, they will become commonplace and be integrated into experimental workflows. 

8. Molecular Representation

I believe a successful ML effort consists of three elements: the data, the representation, and the algorithm.  While a great deal of recent work has focused on algorithms, molecular representation has received limited attention.  In early applications of ML in drug discovery, molecules were represented by fingerprints where positions in a vector represented the presence, absence, or count of a particular molecular feature.  The advent of CNNs and GNNs led to the emergence of learned molecular representations.  While these learned representations have the possibility of outperforming fingerprint models, their superiority has yet to be demonstrated.  As mentioned in the first section of this review, recent results have shown that more traditional ML methods using fingerprint representations provide performance equivalent and sometimes superior to that of more sophisticated techniques. 

A recent paper from Deng and coworkers at Stony Brook University provides an excellent overview of the three prevailing approaches to molecular representation, fingerprints, and self-supervised representations based on SMILES and molecular graphs.  The authors thoroughly review molecular representation and several confounding factors that must be considered when comparing representations and algorithms.

Taking a Respite from Representation Learning for Molecular Property Prediction.

One of the key ML advances of 2022 was the emergence of large language models (LLMs).  The viral status of software tools like ChatGPT brought large language models into the public consciousness.  Several groups published papers showing how LLMs can be used to process libraries of SMILES strings and subsequently produce chemical language models that can be used in QSAR and generative models.  To date, the performance of LLMs on QSAR tasks has been less than spectacular.  While LLM representations have been applied to molecular property prediction, their performance on benchmarks has not been spectacular.  LLMs have shown performance equivalent to more widely used techniques on a few somewhat flawed benchmarks.  LLMs are a very new field, and their application in molecular property prediction is in its infancy.  It’s hoped that future developments in LLMs will lead to new approaches to molecular representation.
ChemBERTa-2: Towards Chemical Foundation Models

BARTSmiles: Generative Masked Language Models for Molecular Representations

Large-Scale Chemical Language Representations Capture Molecular Structure and Properties

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Perspective - Over the last decade, we’ve seen the advent of several neural network approaches to molecular representation.  While these approaches promise to provide an additional level of abstraction, there has yet to be a clear demonstration of their superiority to earlier approaches that use molecular descriptors and fingerprints.  As this field continues to develop, it is hoped that incorporating 3D information will enable representations that more fully capture the underlying molecular interactions.  The new benchmark discussed in Section 1 should provide a more rigorous means of comparing methods that existing benchmark sets. 


  1. Please consider MANORAA. I think it is a good addition to this list

  2. The Elfbar vape is a disposable e-cigarette that comes pre-filled with e-liquid and a fully charged battery, ready for immediate use. It offers a variety of flavors and puff capacities, making it a versatile choice for users. The device uses a specialized heating system and contains either 0%, 2%, or 5% Salt Nic, delivering a smooth throat hit when vaped. It's compact, fitting easily in a pocket, and requires no preparation or cleaning. The Elfbar vape is also considered a more affordable and less harmful alternative to traditional cigarettes.

    Visit : Get Elf Bar TE 6000 Disposable Vape
    Get Elf Bar BC 5000 (Rechargeable)
    Buy Elf Bar TE6000 – Banana
    Elf Bar Pi 9000 Peach Ice
    Elf Bar Pi 9000 Mango
    Buy Elf Bar Pi 9000 Green Apple
    Shop Elf Bar Pi 9000 Pomegranate berry
    Elf Bar Pi 9000 Pink Lemon online in India
    Buy Elf Bar Pi 9000 vapes in India
    Buy Elf Bar TE6000 vapes in India
    get online Elf Bar TE6000 vapes
    Get online Elf bar Lowit pods 12000 vapes


Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Generative Molecular Design Isn't As Easy As People Make It Look