Exploratory Data Analysis With mols2grid and Bemis-Murcko Frameworks

One of the most common tasks in Cheminformatics is exploratory data analysis (EDA).  Given a new dataset, we often need to rapidly explore the chemistry in a set containing hundreds, or even thousands, of molecules.  One useful technique for EDA is the Bemis-Murcko framework.  This technique, originally published by Guy Bemis and Mark Murcko, provides a simple but elegant means of grouping molecules.  Bemis-Murcko frameworks (also known as scaffolds) are created by successively removing monovalent atoms until only ring atoms and linker atoms remain.  There are a few nuances having to do with the removal of exocyclic double bonds and the maintenance of aromaticity, but the method itself is very easy to understand.  There are two versions of the Bemis-Murcko framework, which are sometimes confused.  In the first version, illustrated in the top row of the figure below, monovalent atoms are removed until only ring atoms and linker atoms remain.  In the second version, a generic framework is generated by transforming the result of the previous procedure by converting all atoms to carbon atoms and all bonds to single bonds.  

While both methods can be useful,  I tend to use the first method, which maintains the atom types and bond orders from the input structure.   

In the Jupyter notebook associated with this post, we take a set of structures from the ChEMBL database, calculate the Bemis-Murcko frameworks, then use mols2grid to display the frameworks.  By itself, this would be kind of cool, but Cedric Bouysset made some updates to mols2grid to make this even cooler.  In the latest version of mols2grid, we can create a callback function that is called whenever the user clicks on a structure in the grid.  In his talk at the RDKit User Group Meeting, Cedric showed a nice example where clicking on a molecule in the grid displays a 3D structure below the grid using py3Dmol.  I modified Cedric's code a bit to enable a click on the grid to display another grid.  This provides an easy way to look at things like scaffold->molecule or cluster center->cluster member relationships.  



The notebook also shows how we can use SMARTS to search the grid or the underlying dataframe.   I've found this sort of workflow to be incredibly useful when I'm exploring a new dataset or trying to understand a set of assay results.  Hopefully, you'll find this useful too. 

The Jupyter notebook can be found here on GitHub or can be run directly on Google Colab





Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction