Posts

Silly Things Large Language Models Do With Molecules

Image
  “Pay no attention to the man behind the curtain” - The Wizard of Oz Introduction Recently, a few groups have proposed general-purpose large language models (LLMs) like ChatGPT , Claude , and Gemini as tools for generating molecules. This idea is appealing because it doesn't require specialized software or domain-specific model training. One can provide the LLM with a relatively simple prompt like the one below, and it will respond with a list of SMILES strings. You are a skilled medicinal chemist.  Generate SMILES strings for 100 analogs of the molecule represented by the SMILES CCOC(=O)N1CCC(CC1)N2CCC(CC2)C(=O)N. You can modify both the core and the substituents. Return only the SMILES as a Python list. Don’t put in line breaks. Don't put the prompt into the reply. However, when analyzing molecules created by general-purpose LLMs, I'm reminded of my undergraduate Chemistry days. My roommates, who majored in liberal arts, would often assemble random pieces from my mole

Digging Deeper into Thompson Sampling - A Guest Blog Post by Patrick Riley

Image
  This week is a special Pat substitution.  Pat(rick) Riley  is taking over this blog post for a follow up on our Thompson Sampling paper. In our recent paper  Thompson Sampling─An Efficient Method for Searching Ultralarge Synthesis on Demand Databases  we showed how you could use the classic Thompson Sampling algorithm to select each reagent in a combinatorial library to conduct an efficient search for a variety of scoring functions. Eagle-eyed readers probably noticed that for ROCS, we presented results searching 0.1% of the library and for docking we searched 1% because "docking with TS requires more sampling than ... ROCS". But why is docking harder for a Thompson Sampling based search than ROCS? This post gives an answer to that question. A Visual Version Remember that in Thompson Sampling, for each component in a reaction, you track statistics for each possible reagent. Each reagent is associated with a distribution of scores for complete molecules because the reagent c

Generative Molecular Design Isn't As Easy As People Make It Look

Image
I was taken aback by a recent CNBC article entitled “ Generative AI will be designing new drugs all on its own in the near future ”.  I should know better than to pay attention to AI articles in the popular press, but I feel that even scientists working in drug discovery may have a skewed perception of what generative AI can and can’t do.  To understand exactly what’s involved, it might be instructive to walk through a typical generative molecular design workflow and point out a few things.  First, these programs are far from autonomous.  Even when presented with a well-defined problem, generative algorithms produce a tremendous amount of nonsense.  Second, domain expertise is essential when sifting through the molecules produced by a generative algorithm.  Without a significant medicinal chemistry background, one can’t make sense of the results.  Third, while a few nuggets exist in the generative modeling output, a lot of work and good old-fashioned cheminformatics are required to ext

AI in Drug Discovery - A Highly Opinionated Literature Review (Part III)

Image
Following up on Part I and Part II, the third post in this series is a collection of review articles published in 2023 that I found helpful.  Property Prediction Machine Learning Methods for Small Data Challenges in Molecular Science https://pubs.acs.org/doi/full/10.1021/acs.chemrev.3c00189 Practical guidelines for the use of gradient boosting for molecular property prediction https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00743-7 Application of message passing neural networks for molecular property prediction https://www.sciencedirect.com/science/article/pii/S0959440X23000908?via%3Dihub Molecular Similarity Molecular Similarity: Theory, Applications, and Perspectives https://chemrxiv.org/engage/chemrxiv/article-details/655f59b15bc9fcb5c9354a43 Molecular Representation From intuition to AI: evolution of small molecule representations in drug discovery https://academic.oup.com/bib/article/25/1/bbad422/7455245 Docking and Scoring The Impact of Supervised Learning Method

AI in Drug Discovery - A Highly Opinionated Literature Review (Part II)

Image
Picking up where we left off in Part I , this post covers several other ML in drug discovery topics that interested me in 2023.  Some areas, like large language models, are new, and most of the work is at the proof-of-concept stage.  Others, like active learning, are more mature, and several groups are starting to explore nuances of the methods.   Here’s the structure of Part II.  4. Large Language Models 5. Active Learning 6. Federated Learning 7. Generative Models 8. Explainable AI 9. Other Stuff 4. Large Language Models The emergence of GPT-4 and ChatGPT brought considerable attention to large language models (LLMs) in 2023.  In November and December, several large pharmas held “AI Day” presentations featuring LLM applications for clinical trial data analysis. Many of these groups demonstrated the ability of LLMs to ingest large bodies of unstructured clinical data and subsequently generate tables and reports based on natural language queries.  Aside from some very brief demos on co