Posts

Exploratory Data Analysis With mols2grid and Bemis-Murcko Frameworks

Image
One of the most common tasks in Cheminformatics is exploratory data analysis (EDA).  Given a new dataset, we often need to rapidly explore the chemistry in a set containing hundreds, or even thousands, of molecules.  One useful technique for EDA is the Bemis-Murcko framework .  This technique, originally published by Guy Bemis and Mark Murcko, provides a simple but elegant means of grouping molecules.  Bemis-Murcko frameworks (also known as scaffolds) are created by successively removing monovalent atoms until only ring atoms and linker atoms remain.  There are a few nuances having to do with the removal of exocyclic double bonds and the maintenance of aromaticity, but the method itself is very easy to understand.  There are two versions of the Bemis-Murcko framework, which are sometimes confused.  In the first version, illustrated in the top row of the figure below, monovalent atoms are removed until only ring atoms and linker atoms remain.  In the second version, a generic framework

Similarity Search and Some Cool Pandas Tricks

Image
In this post, we're going to take a look at molecular similarity searches.  Molecular similarity is central to a lot of what we do in Cheminformatics.  It's important for identifying analogs and understanding SAR.  Molecular similarity is also at the core of many clustering methods that we use to understand datasets or design screening libraries.   In this example, we'll be using the chemfp package by Andrew Dalke.  Chemfp has both free and paid tiers.  With the free tier, you can perform similarity searches on smaller datasets, like the one we're using here.  For larger datasets, you need to purchase the paid version.  Chemfp is a great package. If you're using it for production drug discovery, you should buy a license.   In addition to performing searches with chemfp, we'll also go over a few Pandas tricks that will enable us to rapidly process the output from chemfp.  Here's a link to the tutorial notebook on  Google Colab  and on GitHub .  I'd like

Building a multiclass classification model

 A pointer to the fastpages site. 

Practical Cheminformatics - The Directory

In no particular order, here's a hopefully useful, topical organization of the posts I've written over the past few years. Resources and Reviews A Highly Opinionated List of Open Source Cheminformatics Resources AI in Drug Discovery 2020 - A Highly Opinionated Literature Review Clustering Viewing Clustered Chemical Structures in a Jupyter Notebook Clustering 2.1 Million Compounds for $5 With a Little Help From Amazon & Facebook Self-Organizing Maps - 90s Fad or Useful Tool? (Part 1) Self-Organizing Maps - The Code (Part 2) Molecule Generation Automatic Analog Generation With Common R-group Replacements Predictive Models Predicting Aqueous Solubility - It's Harder Than It Looks Assessing Interpretable Models High-Performance Computing Fast Parallel Cheminformatics Workflows With Dask Wicked Fast Cheminformatics with NVIDIA RAPIDS Databases What Do Molecules That Look LIke This Tend To Do? Adding Chemical Structures to a Recent COVID-19 Drug Repurposing Dataset Filtering

Viewing Clustered Chemical Structures in a Jupyter Notebook

Image
In Cheminformatics, we frequently run into cases where we want to look at leader/follower relationships between chemical structures.  For instance, if we've clustered a set of molecules, we might want to start by looking at a table with one example structure for each cluster.  We'd then like to be able to select one or more "interesting" clusters and drill down to the cluster members.  While this is a frequent workflow, I'm not aware of commercially or freely available tools that do a great job of supporting the exploration of leader/follower relationships with chemical structures.  In this post, we'll look at one way of connecting a couple of open source libraries to view cluster representatives and cluster members.  As usual, the code and data associated with this post are available on GitHub .  Rather than trying to explain this more fully, let's consider an example.  In this example, we'll look at a set of 1,495 drugs from the ChEMBL  database.  If

Automatic Analog Generation With Common R-group Replacements

 Another pointer to the FastPages site