AI in Drug Discovery - A Highly Opinionated Literature Review (Part II)


Picking up where we left off in Part I, this post covers several other ML in drug discovery topics that interested me in 2023.  Some areas, like large language models, are new, and most of the work is at the proof-of-concept stage.  Others, like active learning, are more mature, and several groups are starting to explore nuances of the methods.   Here’s the structure of Part II. 

4. Large Language Models
5. Active Learning
6. Federated Learning
7. Generative Models
8. Explainable AI
9. Other Stuff

4. Large Language Models

The emergence of GPT-4 and ChatGPT brought considerable attention to large language models (LLMs) in 2023.  In November and December, several large pharmas held “AI Day” presentations featuring LLM applications for clinical trial data analysis. Many of these groups demonstrated the ability of LLMs to ingest large bodies of unstructured clinical data and subsequently generate tables and reports based on natural language queries.  Aside from some very brief demos on code generation and literature searches, mentions of LLM applications in preclinical research were scarce. 

Most of the LLM activity in the drug discovery space in 2023 was reported as preprints from academic groups. A preprint from Microsoft AI Research provided a broad catalog of potential LLM applications in the physical and life sciences.  Most of the drug discovery examples were underwhelming.  The authors provided examples where GPT-4 could provide a SMILES string, IUPAC name, or descriptive text for a marketed drug.  This is interesting, but a simple Google search can perform the same tasks. GPT-4 could provide the sequence and binding sites for the SARS-CoV-2 main protease, but this information can also be readily accessed through the PDB.  When given the prompt “Please estimate the binding affinity between the drug Afatinib and target EGFR”, GPT-4 suggested a docking study, which is definitely not a wise choice.  I found it puzzling when the Microsoft team referred to GPT-4’s “understanding of fundamental drug discovery concepts”.  The responses seemed more like a “stochastic parrot”  that could grab information and key phrases from online documents. 

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4 https://arxiv.org/abs/2311.07361

In an editorial in Nature Chemistry, White proposed that “The Future of Chemistry is Language”.  He pointed out that LLMs like GPT-4 can translate between molecular formats, look up chemical names and properties, and provide input files for quantum chemical calculations.  He described how LLMs can summarize complex documents and streamline workflows. White also highlighted the challenges arising from the tendency of LLMs to hallucinate and generate incorrect text.  He suggested several strategies for addressing hallucinations, including providing the LLM access to domain-specific background information and restructuring how prompts are constructed.  

The future of chemistry is language
https://www.nature.com/articles/s41570-023-00502-0

As mentioned above, one of the primary drawbacks of LLMs is their tendency to hallucinate and produce incorrect results. This problem is compounded by the LLM’s inability to provide attribution for answers.  ChatGPT will happily provide citations when asked for references supporting its assertions.  The problem is that these references, while appearing valid, often point to papers that don’t exist.  A paper by Lála and coworkers from FutureHouse and the Francis Crick Institute addresses these limitations with PaperQA, a Retrieval-Augmented Generation (RAG) agent for the scientific literature.  PaperQA begins by constructing LLM search queries from a set of keywords.  The results of these searches are aggregated into a vector database and combined with a pre-trained LLM to create a summary of the search results. In benchmark comparisons, the differences between answers provided by PaperQA and human evaluators were similar to differences between individual human evaluators.  Encouragingly, unlike many other LLMs, PaperQA didn’t hallucinate citations. 

PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
https://arxiv.org/abs/2312.07559

One means of enhancing the ability of LLMs to perform domain-specific tasks is to provide a set of “helper” programs that the LLM can call. A paper by Bran and coworkers from EFPL and the University of Rochester presented ChemCrow, a system for integrating Chemistry capabilities into LLMs. ChemCrow provides software tools for performing domain-specific tasks, including web searches, file format conversions, and similarity searches.  Compared with GPT-4, ChemCrow provided superior performance on tasks like synthetic route planning.  The authors also point to potential misuse of LLMs and suggest mitigation strategies. 

ChemCrow: Augmenting large-language models with chemistry tools
https://arxiv.org/abs/2304.05376

Given the complexity of programming multiple instruments and identifying appropriate experimental conditions, laboratory automation is another area that can benefit from applying LLMs.  A paper by Bioko and coworkers from Carnegie Mellon University presented Coscientist, a set of LLMs for designing and executing organic syntheses.  Coscientist consists of four components designed to search the web, write Python code, extract information from documentation, and program laboratory robotics.  The authors test Coscientist using several open and closed-source LLMs and present examples of the system's ability to plan and execute simple organic syntheses. 

Autonomous chemical research with large language models
https://www.nature.com/articles/s41586-023-06792-0

Perspective:  It’s early days for LLMs, and I think it’s a stretch to say that GPT-4 or any other LLM understands Chemistry.  At this point, LLMs seem to have two general use cases.  First, summarization and information retrieval.  LLMs can parse vast collections of text, which can be queried using natural language.  These information retrieval capabilities have many applications, from writing computer code and collating clinical trial results to summarizing papers on a specific topic.  While there are still issues with LLMs hallucinating and providing incorrect information, tools and strategies are being developed to ensure the validity of LLM responses.  The other area where LLMs appear to be making inroads is workflow management.  Many activities in drug discovery, whether computational or experimental, require long sequences of steps, which can be tedious to orchestrate.  While it is often possible to script the execution of these steps, scripting requires a detailed knowledge of each step.  LLMs have the potential to simplify this process and carry out multi-step procedures given only a set of initial conditions and a final objective.  While the amount of progress the field has made in a short time is impressive, I don’t see LLMs replacing scientists any time soon. 

5.  Active Learning

Active learning (AL) provides a means of prioritizing computationally expensive calculations and enabling the processing of large datasets.  In a typical AL workflow, a machine learning model is used as a surrogate for a more expensive calculation.  For example, consider free energy calculations, which are accurate but typically require several hours of calculation to estimate the binding affinity of one molecule.   One can begin by selecting a small subset (let’s say 100) from a larger library of molecules and running free energy calculations on the molecules in the subset.  Molecular descriptors derived from the subset molecules can then be used to build an ML model that relates these descriptors to a relative or absolute binding free energy.  The ML model can predict the remaining molecules' free energies and prioritize the next set of free energy calculations.  By repeating this process several times, many of the most promising molecules can be identified by evaluating a small fraction of a larger dataset.  In 2022, several papers, including one of ours, showed the promise of active learning approaches in drug discovery.  

In 2023, we saw several papers published describing active learning applications.  In addition to application-related papers, van Tilborg and Grisoni's preprint investigated active learning methodologies.  The authors compared six active learning approaches on three benchmark datasets and concluded that the acquisition function is critical to AL performance. When comparing molecular representations, they found that fingerprints generalized better than graph neural networks.  Consistent with previous studies, they found that the choice of an initial training set had little impact on the outcome of an AL model. 

Traversing Chemical Space with Active Deep Learning
https://chemrxiv.org/engage/chemrxiv/article-details/655f22eecf8b3c3cd7ef43d7

A preprint by Gorantla and coworkers from the University of Edinburgh and Exscientia took a similar approach. They evaluated two machine learning models, Gaussian Process (GP) regression and the message-passing neural network ChemProp (CP), with different AL initialization strategies and acquisition functions.  These methods were evaluated on four different datasets from the literature.  The authors found that the diversity of the dataset was the most critical factor for the success of AL methods.  Datasets with small numbers of scaffolds and large numbers of substitutions tended to produce the best results.  With more diverse datasets, they found that increasing the batch size on each AL cycle improved the recovery of the best molecules. Interestingly, the authors found that while the performance of CP was degraded when synthetic noise was added to the data, GP could still identify the most promising regions of chemical space.

Benchmarking active learning protocols for ligand binding affinity prediction
https://www.biorxiv.org/content/10.1101/2023.11.24.568570v1

One factor driving the growth of AL is the emergence of synthesis-on-demand libraries like Enamine REAL and WuXi GalaXi.  As these libraries have grown to billions of molecules, it is no longer cost-effective to exhaustively evaluate every molecule.  A review by Kuan and coworkers from the University of British Columbia and the University of Ottowa provides an overview of several approaches applied to the virtual screening of synthesis-on-demand libraries.  The authors discuss fragment-based methods, ML emulators, and active learning.  The paper is well-referenced and provides a good introduction for those new to the field. 

Keeping pace with the explosive growth of chemical libraries with structure-based virtual screening
https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.1678

A paper by Sivula and colleagues from the University of Eastern Finland and Orion Pharma describes applying their active learning method, HASTEN, to dock the 1.56 billion molecule Enamine REAL lead-like library into two different protein structures.  In both cases, HASTEN identified 90% of the top 100 hits by docking only 1% of the library.  In addition to making their code available on GitHub, the authors provided an enumerated set of conformers for the Enamine library prepared for the Glide docking program. 

Machine Learning-Boosted Docking Enables the Efficient Structure-Based Virtual Screening of Giga-Scale Enumerated Chemical Libraries
https://pubs.acs.org/doi/10.1021/acs.jcim.3c01239

A different type of AL approach, called PyRMD2Dock, was reported in a paper by Roggia and colleagues from the University of Campania Luigi Vanvitelli.  Their approach is similar to other active learning methods in the initial stage.  A set of 1 million molecules is sampled from an ultra-large database, and this subset is docked using AutoDock Vina.  A docking score threshold is then used to classify molecules as active or inactive, and this data is used to train PyRMD, a ligand-based virtual screening tool developed by the authors. The trained PyRMD model screens the ultra-large database and selects a subset of molecules to be docked. Finally, the molecules selected by PyRMD are docked, and the corresponding scores are used to prioritize compound purchases.  

Streamlining Large Chemical Library Docking with Artificial Intelligence: the PyRMD2Dock Approach
https://pubs.acs.org/doi/10.1021/acs.jcim.3c00647

Several recent papers have shown that AL isn’t particularly sensitive to the ML model used.  In our 2022 paper, we evaluated 5 different ML models and saw minimal differences in performance.  A paper by Marin and coworkers from the Moscow Institute of Physics and Technology took this idea to the extreme and demonstrated that linear regression performed well when predicting docking scores from molecular fingerprints.  The authors found that the performance of linear regression was as good or better than methods like random forest or SVM regression.   Moreover, linear regression was more than 100 times faster than other, more sophisticated, ML methods.  These performance gains can become significant when performing multiple cycles of inference on databases with billions of molecules.

Regression-Based Active Learning for Accessible Acceleration of Ultra-Large Library Docking
https://pubs.acs.org/doi/10.1021/acs.jcim.3c01661

Perspective: Active learning is a simple yet powerful technique that enables computational access to large and ultra-large (I’m not sure where the cutoff is) collections of molecules.  Papers published in 2023 showed how robust the method is.  Multiple studies showed that while the choice of the acquisition function matters, the method is largely insensitive to the composition of the initial training set.  These papers also showed that AL can be applied to various ML methods and molecular representations.  As the field progresses, it will be interesting to see whether similar approaches can be applied to high-throughput experimentation. 

6. Federated Learning

An editorial in Nature by Mock and coworkers from Amgen reflected on some recent advances brought about by ML in drug discovery and pointed out that the success of ML models hinges on the availability of high-quality data.  While large pharmaceutical companies have data on thousands or even millions of compounds, this data is rarely shared due to intellectual property concerns.  The authors suggest that federated learning provides one potential solution to the data-sharing conundrum.  With federated learning, chemical structures and assay data from multiple companies are encoded by a neutral third party, and the resulting molecular representation is subsequently used for model building.  In principle, this enables multiple groups to share data without compromising intellectual property. 

AI can help to speed up drug discovery — but only if we give it the right data
https://www.nature.com/articles/d41586-023-02896-9

In the largest example of federated learning reported to date, ten European pharmaceutical companies joined forces in the MELLODDY consortium.  Data from 21 million molecules and more than 40,000 assays was securely combined into a multitask model.  For classification models, the participants reported a median 2.5-7.5% increase in AUC-PR over internal models.  For regression models, the median gain was a more modest 1.8%.  While the participants reported a 10% increase in the applicability domain based on a conformal metric, the direct benefits of this improvement are difficult to measure. 

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information
https://pubs.acs.org/doi/10.1021/acs.jcim.3c00799

Conformal efficiency as a metric for comparative model assessment befitting federated learning
https://www.sciencedirect.com/science/article/pii/S2667318523000144

Perspective: While I agree that high-quality, relevant data is the most critical element in building ML models, I don’t think federated learning is the answer.  The performance gains from the MELLODDY consortium can, at best, be considered modest.  Any effort like MELLODDY, where data privacy is the prime concern, will have to compromise on the molecular representation used.  No member company will agree to use a representation that can potentially be reverse-engineered to reveal the chemical structures of its molecules. In addition, since the data being used is blinded, similar assays from different companies can’t be easily combined.  Further, private consortia do little to advance the field.  Talented scientists in academia or non-consortium companies will not have the opportunity to contribute, and there will be limited opportunities to compare notes on what works and what doesn’t.  It would be far more productive to privately or publicly fund an effort to generate large, relevant datasets and make the data publicly accessible.  As I’ve previously pointed out, most public datasets used in our field are terrible.  To progress, the field needs access to high-quality benchmark datasets generated through consistent experiments.  It’s difficult to believe we can improve molecular representations or algorithms without suitable datasets. 

7. Generative Models

Work on generative models seems to have shifted a bit in 2023.  In prior years, much of the published work focused on dynamically updating a generative model to meet specific objectives.  For example, consider a generative model designed to optimize the docking score for a molecule.  The model generates a SMILES string, converts the SMILES to a set of 3D conformers, docks the conformers, and returns the best docking score for the molecule.  These generative models typically begin by training an initial model with SMILES from a large database, like ChEMBL.  The model learns which characters tend to follow other characters and implicitly learns the relationships between functional groups.  Once the initial model is trained, new molecules are generated and docked.  The docking scores are then used to update probability distributions associated with SMILES characters in subsequent molecule generation steps.   This sounds great, but in many cases, the molecule generator simply learns to exploit the scoring function and generates ridiculous molecules.  One can sometimes overcome these limitations using a multi-objective scoring function, but balancing the objectives can be tricky.  

Ren and coworkers from the Chinese Academy of Sciences published a preprint with one approach to overcoming some limitations of scoring structure-based generative models. The authors noticed that high-scoring molecules produced by generative models tended to be promiscuous and generate high scores for a range of target proteins.  To address this challenge, they developed a new metric, Delta Score, which uses a panel of unrelated proteins to correct docking scores and identify “selective” molecules.  It should be noted that this isn’t a new idea.  A 2004 paper by Vigers and Rizzi proposed a similar strategy for docking studies.  What’s old is new again. 

Delta Score: Improving the Binding Assessment of Structure-Based Drug Design Methods
https://arxiv.org/abs/2311.12035

Most molecule generation benchmarks, such as GuacaMol and MOSES, focus on the diversity and validity of the generated molecules.  Few efforts have examined how well generative approaches approximate the selections made during a medicinal chemistry program.   A preprint by Handa and coworkers at the University of Cambridge and the Teijin Institute for Bio-medical Research examined the ability of a molecule generator to reproduce the medicinal chemistry trajectory of actual and simulated drug discovery projects.   The authors defined a clustering procedure that simulated a compound optimization process and also examined compounds from six drug discovery projects at TEIJIN Pharma.  Starting with early-stage hits, they examined the ability of the open source generative molecular design program REINVENT to generate molecules from the middle and later stages of the drug discovery programs.   The results weren't good.  REINVENT only rediscovered between 0.0 and 1.6% of the molecules synthesized for the drug discovery programs.  The authors point out that retrospective studies like these do not capture the multiple objectives and challenges that arise throughout a drug discovery program and may not provide an appropriate validation. 

On The Difficulty of Validating Molecular Generative Models Realistically: A Case Study on Public and Proprietary Data
https://chemrxiv.org/engage/chemrxiv/article-details/655aa0416e0ec7777f4a3682

Most of the generative modeling papers and talks I saw this year abandoned using a scoring function to augment probability distributions and instead focused on generating analogs around an existing hit or lead molecule.  In a typical workflow, 1-10K analogs were generated based on an input molecule.  These analogs were then docked into a protein binding site, and more rigorous methods like free energy calculations were used to prioritize analogs for synthesis or purchase.  These approaches are similar in spirit to CReM and related methods, which use matched molecular pairs to suggest functional group replacements and generate new molecules. 

Noutahi and colleagues at Valence Labs (Recursion) developed a novel molecular representation called Sequential Attachment-based Fragment Embedding (SAFE), which can be used for molecule generation. The SAFE embeddings represent molecules in a manner similar to the “dot disconnected SMILES” representation proposed by Ho and Marshall in the 1990s.  For instance, instead of representing ethane as the SMILES “CC”, we can also express it as “C1.C1”.  While the second SMILES appears to be two disconnected fragments, the fragments are connected by a “fake” ring closure.  By representing molecules in this way, we can create a natural segmentation between functional groups and ring systems.  By reducing molecules to the SAFE embedding, functional group interchanges like those in matched molecular pairs can be readily identified.  In this paper, the authors then used 1.1 billion SAFE embeddings to train a generative model they call SAFE-GPT.  

Gotta be SAFE: A New Framework for Molecular Design
https://arxiv.org/abs/2310.10773

A blog post by Edward Williams from Terray Therapeutics introduced Contrastive Optimization for Accelerated Therapeutic Inference (COATI). This novel embedding considers both the topological and three-dimensional structure of molecules.  Williams showed how the COATI embedding can be used to build predictive models and decoded to regenerate the input molecule.  Analogs can be generated by adding small amounts of random noise to the COATI embedding.  An example in the Terray GitHub repository shows how COATI embeddings can be coupled with a regressor to design carbonic anhydrase inhibitors. 

A Tutorial on Encoding and Generating Small Molecules with COATI
https://portal.valencelabs.com/blogs/post/a-tutorial-on-encoding-and-generating-small-molecules-with-coati-NYULbnTs8oc4x4u

COATI: multi-modal contrastive pre-training for representing and traversing chemical space
https://chemrxiv.org/engage/chemrxiv/article-details/64e8137fdd1a73847f73f7aa

The team at AstraZeneca released version 4 of their software package REINVENT for generative design.  This version augments the denovo design capabilities of the previous versions with new R-group replacement and molecule-linking capabilities. 

REINVENT4: Modern AI–Driven Generative Molecule Design
https://chemrxiv.org/engage/chemrxiv/article-details/65463cafc573f893f1cae33a

Perspective:  As generative models become more pervasive, the focus seems to have shifted from the exotic to the practical.  Rather than using multiple generative cycles to optimize proposed molecules computationally, many groups have shifted to using generative models as practical idea generators.  These ideas are subsequently evaluated using physics-based methods and ML models to select molecules for synthesis.  As the field progresses, it will be interesting to see if we can overcome the tendency of generative models to exploit certain aspects of scoring functions.  It may be possible that some of the multi-parameter optimization ideas discussed below will be useful in improving the quality of the molecules produced by generative molecular design.  

8. Explainable AI

Most ML models we use in practice treat the prediction task as a black box.  These models take some input, typically derived from a chemical structure, and produce a predicted value.  While the model generates some real or categorical value, it doesn’t provide a rationale for the prediction.  This lack of interpretability creates a conundrum for the molecule designer, who can only use the model to evaluate human or machine-generated ideas.  Instead, it would be beneficial to have models that can provide human-interpretable explanations that can be used to drive subsequent design cycles.  Three reviews published in 2023 provide excellent introductions to the current state of the art in explainable AI (XAI) in drug discovery.  A paper by Wellawatte and coworkers from the University of Rochester provides a systematic overview of several approaches to explaining machine learning models for molecules.  The authors discuss attribution methods, surrogate models, and counterfactual examples.  

A Perspective on Explanations of Molecular Prediction Models
https://pubs.acs.org/doi/10.1021/acs.jctc.2c01235

A review by Wu and coworkers from Zhejiang University creates a different taxonomy for XAI and provides an overview of the strengths and limitations of several methods.  The authors focus on XAI in Chemistry and divide currently applied methods into five categories.  

  • Self-explaining methods like linear regression or decision trees intrinsically assign weights to specific features.  
  • Gradient-based methods use the derivatives of a neural network's output with respect to the input features to assign feature importance.  
  • Perturbation-based methods, where a model is retrained with specific features perturbed or masked out.  The magnitude of the change in the model output provides some sense of a feature's importance.  
  • Surrogate models like SHAP and LIME use simpler, more intuitive models to explain the behavior of more complex models.  
  • Counterfactual explanations use large differences in predicted values between closely related molecules to provide plausible explanations. 
From Black Boxes to Actionable Insights: A Perspective on Explainable Artificial Intelligence for Scientific Discovery

While not specifically related to identifying important molecular features, a review by Karim from ALDI SÜD and coworkers from various institutions provides a comprehensive overview of explainable AI and its applications in Bioinformatics.  The authors discuss why interpretable AI is important and catalog various methods for interpreting ML models.  In addition to the paper, the authors have a GitHub repository with links to dozens of related papers and code libraries.  This paper is one of the best of the year and will interest anyone applying ML in drug discovery. 

Explainable AI for Bioinformatics: Methods, Tools and Applications
https://academic.oup.com/bib/article/24/5/bbad236/7227172

When attempting to understand ML models in drug discovery, one common approach is to rank feature importance using a method like Shapley values or LIME, then project this feature importance onto the structures of training or test set molecules.  The XSMILES toolkit developed by Heberle and colleagues at Bayer AG provides the ability to color and highlight important features from ML models.  The authors have developed a novel heatmap visualization that integrates a molecule's chemical structure and the SMILES string.  The code available on GitHub can be easily integrated into Knime workflows or Jupyter notebooks.

XSMILES: interactive visualization for molecules, SMILES and XAI attribution scores
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00673-w

One downside to methods that project feature importance onto molecular structures is that, in many cases, the features are not chemically intuitive.   The projection of features is typically indirect.  A method like SHAP or LIME identifies important features, which map to fingerprint bits.  These fingerprint bits then map to atom environments or bond paths, which often don’t correspond to widely accepted functional group definitions.  This disconnect can sometimes make it difficult to understand precisely what a model is learning.  An alternate approach, published by Rao and coworkers from TCS Research, is to use a fragmenting scheme like BRICS or RECAP that decomposes molecules into intuitive building blocks.  One can then calculate feature importance for the fragments and hopefully come to a more intuitive interpretation.  Sadly, this paper concluded with one of my least favorite phrases, “The code used to generate results shown in this study is available from the corresponding author upon request for academic use only”. 

pBRICS: A Novel Fragmentation Method for Explainable PropertyPredictionof Drug-Like Small Molecules
https://pubs.acs.org/doi/epdf/10.1021/acs.jcim.3c00689

Perspective:  I can’t say I saw a lot of progress in explainable AI for molecules this year.  More papers showed how feature importance can be used as a tool for model interpretability, but in most cases, the explanations were superficial and not particularly helpful.   As I’ve mentioned in the past, even benchmarking explainability is tricky.  In many ways, interpretability is limited by the representations we use for molecules in ML models.  Hashed fingerprint bits often don’t map to intuitive substructures.  In addition, a single fingerprint bit can sometimes map to different substructures in different molecules due to hash key collisions.  The situation with neural networks and learned representations can be even worse.  While techniques like integrated gradients can help interpret neural networks, mapping network weights to chemical structures is an open problem.  We may have to move to more physically inspired molecular representations to arrive at genuinely interpretable models. 

9. Other Stuff

This final category is a catch-all for papers that didn’t fit neatly into the other categories.  First up is multi-parameter optimization (MPO), or multi-objective optimization (MOO).  Drug discovery is inherently an MPO problem.  In addition to optimizing potency against a target of interest, we must design selective compounds with optimal pharmacokinetic profiles and a host of other desirable properties.  While MPO is a critical component of drug design, it has received limited attention in the computer-aided drug design (CADD) community.  A review by Fromer and Coley provides an excellent overview of approaches to MPO in drug discovery.   The authors define multi-objective optimization and how it fits into drug discovery.  They discuss several approaches, including Pareto and Bayesian optimization strategies.  Examples are provided from virtual library optimization and generative design.  In a separate preprint, the same authors provide three example applications of MPO for searching virtual libraries containing millions of molecules.  In each example, the authors use a modified version of their open source package MolPAL to design molecules with a specified selectivity profile. 

Pareto Optimization to Accelerate Multi-Objective Virtual Screening
https://arxiv.org/abs/2310.10598

Computer-aided multi-objective optimization in small molecule discovery
https://www.cell.com/patterns/fulltext/S2666-3899(23)00001-6

I mentioned the next paper in a LinkedIn post last year, but it’s worth mentioning again. People frequently ask me why machine learning (ML) in QSAR doesn’t take advantage of 3D structures. Unfortunately, a fundamental disconnect exists between traditional ML and the appropriate use of 3D molecular structures. Traditional ML uses the relationship between a single instance (a chemical structure) and a single label (a property). It doesn’t provide a facility for mapping multiple instances (an ensemble of conformers) to a label. There has recently been renewed interest in multiple instance learning (MIL), a technique developed over 30 years ago. MIL provides a framework that enables the mapping of conformational ensembles to properties. A recent review by Zankov from Hokkaido University and coworkers at other institutions provides an excellent overview of the challenges and opportunities associated with MIL in QSAR, genomics, and several other areas. The paper also provides links to several software packages for building MIL models.

Chemical complexity challenge: Is multi-instance machine learning a solution
https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.1698

Perspective: I hope these two important topics receive more attention in the coming year. 

Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction