Posts

Showing posts from 2021

Picking the Highest Scoring Molecule(s) From Each Cluster

 Here's a quick post based on a conversation with a friend who wanted to be able to cluster a set of docked molecules based on fingerprints and select the highest scoring molecule(s) from each cluster.  As usual, Pandas made this super easy.    The code for this example can be run on Colab  and is also available as a Gist . 

Exploratory Data Analysis With mols2grid and Bemis-Murcko Frameworks

Image
One of the most common tasks in Cheminformatics is exploratory data analysis (EDA).  Given a new dataset, we often need to rapidly explore the chemistry in a set containing hundreds, or even thousands, of molecules.  One useful technique for EDA is the Bemis-Murcko framework .  This technique, originally published by Guy Bemis and Mark Murcko, provides a simple but elegant means of grouping molecules.  Bemis-Murcko frameworks (also known as scaffolds) are created by successively removing monovalent atoms until only ring atoms and linker atoms remain.  There are a few nuances having to do with the removal of exocyclic double bonds and the maintenance of aromaticity, but the method itself is very easy to understand.  There are two versions of the Bemis-Murcko framework, which are sometimes confused.  In the first version, illustrated in the top row of the figure below, monovalent atoms are removed until only ring atoms and linker atoms remain. ...

Similarity Search and Some Cool Pandas Tricks

Image
In this post, we're going to take a look at molecular similarity searches.  Molecular similarity is central to a lot of what we do in Cheminformatics.  It's important for identifying analogs and understanding SAR.  Molecular similarity is also at the core of many clustering methods that we use to understand datasets or design screening libraries.   In this example, we'll be using the chemfp package by Andrew Dalke.  Chemfp has both free and paid tiers.  With the free tier, you can perform similarity searches on smaller datasets, like the one we're using here.  For larger datasets, you need to purchase the paid version.  Chemfp is a great package. If you're using it for production drug discovery, you should buy a license.   In addition to performing searches with chemfp, we'll also go over a few Pandas tricks that will enable us to rapidly process the output from chemfp.  Here's a link to the tutorial notebook on  Google C...

Building a multiclass classification model

 A pointer to the fastpages site. 

Practical Cheminformatics - The Directory

In no particular order, here's a hopefully useful, topical organization of the posts I've written over the past few years. Resources and Reviews A Highly Opinionated List of Open Source Cheminformatics Resources AI in Drug Discovery 2020 - A Highly Opinionated Literature Review Clustering Viewing Clustered Chemical Structures in a Jupyter Notebook Clustering 2.1 Million Compounds for $5 With a Little Help From Amazon & Facebook Self-Organizing Maps - 90s Fad or Useful Tool? (Part 1) Self-Organizing Maps - The Code (Part 2) Molecule Generation Automatic Analog Generation With Common R-group Replacements Predictive Models Predicting Aqueous Solubility - It's Harder Than It Looks Assessing Interpretable Models High-Performance Computing Fast Parallel Cheminformatics Workflows With Dask Wicked Fast Cheminformatics with NVIDIA RAPIDS Databases What Do Molecules That Look LIke This Tend To Do? Adding Chemical Structures to a Recent COVID-19 Drug Repurposing Dataset Filtering ...

Viewing Clustered Chemical Structures in a Jupyter Notebook

Image
In Cheminformatics, we frequently run into cases where we want to look at leader/follower relationships between chemical structures.  For instance, if we've clustered a set of molecules, we might want to start by looking at a table with one example structure for each cluster.  We'd then like to be able to select one or more "interesting" clusters and drill down to the cluster members.  While this is a frequent workflow, I'm not aware of commercially or freely available tools that do a great job of supporting the exploration of leader/follower relationships with chemical structures.  In this post, we'll look at one way of connecting a couple of open source libraries to view cluster representatives and cluster members.  As usual, the code and data associated with this post are available on GitHub .  Rather than trying to explain this more fully, let's consider an example.  In this example, we'll look at a set of 1,495 drugs from the ChEMBL  database...

Automatic Analog Generation With Common R-group Replacements

 Another pointer to the FastPages site

Assessing Interpretable Models

 This is just a pointer to the fastpages blog post . 

Fast Parallel Cheminformatics Workflows With Dask

 This is just a pointer to my new fastpages blog site . 

AI in Drug Discovery 2020 - A Highly Opinionated Literature Review

In this post, I present an annotated bibliography of some of the interesting machine learning papers I read in 2020.   Please don't be offended if your paper isn't on the list.  Leave a comment with other papers you think should be included. I've tried to organize these papers by topic.  Please be aware that the topics, selected papers, and the comments below reflect my own biases.  I've endeavored to focus primarily on papers that include source code .   Hopefully, this list reflects a few interesting trends I saw this year.  More of a practical focus on active learning Efforts to address model uncertainty, as well as the admission that it's a very difficult problem The (re)emergence of molecular representations that incorporate 3D structure Several interesting strategies for data augmentation Additional efforts toward model interpretability, coupled with the acknowledgment that this is also a difficult problem The application of generative model...