Dissecting the Hype With Cheminformatics
A recent paper in Nature Biotechnology reported the use of AI in the discovery of inhibitors of DDR1, a kinase which has been implicated in fibrosis. Many who have worked in kinase drug discovery may have noticed that the most prominent compound (Compound 1) from this paper bears a striking resemblance to a certain marketed drug. Let's assume the compound appears familiar, but we can't specifically place it. How can we use Cheminformatics to find drugs similar to the compound highlighted in this paper? Let's take look.
In case you're interested in running the code, there is a Jupyter notebook on Github. If you don't like code, skip to the bottom of this post for the punch line. If you want a really easy way to run the code, try this link to a Binderized version. This will allow you to run the notebook without downloading anything. Thanks to Peter Rose for showing me how to do this.
First, we will import the Python libraries that we'll need. We'll also set a couple of flags to make the chemical structures prettier.
We will start with Compound 1 from the paper.
A quick check to ensure that we have the correct SMILES.
Let's add a molecule column to the dataframe to make it easier to view the chemical structures
Now we'll add a fingerprint column to our Pandas table so that we can do a similarity search.
Next, we can generate a fingerprint for Compound 1 and use that to do a similarity search.
Let's look at the 5 most similar compounds.

Hmmm, the first compound above looks a lot like Compound 1. Just to be certain that we've found what we need, why don't we try an alternate method of calculating similarity. In this case, we'll calculate the number of atoms in the maximum common subgraph (MCS) for Compound 1 and each of the small molecule drugs. MCS calculations are time-consuming so this isn't the sort of thing we want to do with a large database. However, in this case we only have a few thousand drugs, so the calculation isn't prohibitive. On my MacBook Pro, this takes about a minute and a half.
We'll start by defining a function that calculates the number of atoms in the MCS for two molecules.
Next, we will run this over the small molecule drugs in our dataframe.
We can add the number of atoms in the MCS to our dataframe.
Now we sort the dataframe by the number of atoms in the MCS and take a look at the 5 compounds with the largest MCS.

While the two most similar molecules are the same as those we identified using the similarity search, we can see that the MCS search uncovers a few different molecules. Let's learn a bit more about the drug that is most similar to Compound 1. We can start by getting its ChEMBL identifier.
We can use the ChEMBL API to get the names associated with this molecule.
This molecule is Pontatinib. A quick Google search shows us that this is a marketed drug originally developed as an inhibitor of BCR-ABL. Ponatinib is also a promiscuous inhibitor of a number of other kinases including DDR1. In fact, extensive SAR around the activity of Ponatinib analogs against DDR1 was reported in a 2013 paper in J.Med.Chem.
Let's generate a side-by-side visualization of Compound 1 and Ponatinib with the MCS highlighted.

Given the similarity of Compound 1 to a marketed drug with 9nM biochemical activity against DDR1 and good pharmacokinetics (PK), the activity and PK profiles of Compound 1 are not particularly surprising.
In case you're interested in running the code, there is a Jupyter notebook on Github. If you don't like code, skip to the bottom of this post for the punch line. If you want a really easy way to run the code, try this link to a Binderized version. This will allow you to run the notebook without downloading anything. Thanks to Peter Rose for showing me how to do this.
First, we will import the Python libraries that we'll need. We'll also set a couple of flags to make the chemical structures prettier.
from chembl_webresource_client.new_client import new_client from rdkit import Chem from rdkit.Chem.Draw import MolsToGridImage from rdkit.Chem.Draw import IPythonConsole import pandas as pd from rdkit.Chem import rdFMCS from tqdm import tqdm from rdkit.Chem import PandasTools from rdkit.Chem import AllChem from rdkit.Chem import rdDepictor from rdkit.Chem.Fingerprints import FingerprintMols from rdkit import DataStructs rdDepictor.SetPreferCoordGen(True) IPythonConsole.ipython_useSVG = True
We will start with Compound 1 from the paper.
compound_1_smiles = "Cc1ccc2c(Nc3cccc(c3)C(F)(F)F)noc2c1C#Cc1cnc2cccnn12"compound_1_mol = Chem.MolFromSmiles(compound_1_smiles)
A quick check to ensure that we have the correct SMILES.
compound_1_mol
Now let's use the newly released ChEMBL API to grab the SMILES for all of the small molecule drugs.  Once we have the SMILES and ChEMBL Ids for the drugs, we'll put this into a Pandas dataframe. 
molecule = new_client.molecule approved_drugs = molecule.filter(max_phase=4) small_molecule_drugs = [x for x in approved_drugs if x['molecule_type'] == 'Small molecule'] struct_list = [(x['molecule_chembl_id'],x['molecule_structures'])for x in small_molecule_drugs if x] smiles_list = [(a,b['canonical_smiles']) for (a,b) in struct_list if b] smiles_df = pd.DataFrame(smiles_list) smiles_df.columns = ['ChEMBL_ID','SMILES']
Let's add a molecule column to the dataframe to make it easier to view the chemical structures
PandasTools.AddMoleculeColumnToFrame(smiles_df,'SMILES','Mol')
Now we'll add a fingerprint column to our Pandas table so that we can do a similarity search.
smiles_df['fp'] = [FingerprintMols.FingerprintMol(x) for x in smiles_df.Mol]
Next, we can generate a fingerprint for Compound 1 and use that to do a similarity search.
compound_1_fp = FingerprintMols.FingerprintMol(compound_1_mol) smiles_df['fp_sim'] = [DataStructs.TanimotoSimilarity(compound_1_fp,x) for x in smiles_df.fp] smiles_df.sort_values("fp_sim",inplace=True,ascending=False)
Let's look at the 5 most similar compounds.
top5_sim_df = smiles_df.head() MolsToGridImage(top5_sim_df.Mol,molsPerRow=5,legends=["%.2f" % x for x in top5_sim_df.fp_sim])

Hmmm, the first compound above looks a lot like Compound 1. Just to be certain that we've found what we need, why don't we try an alternate method of calculating similarity. In this case, we'll calculate the number of atoms in the maximum common subgraph (MCS) for Compound 1 and each of the small molecule drugs. MCS calculations are time-consuming so this isn't the sort of thing we want to do with a large database. However, in this case we only have a few thousand drugs, so the calculation isn't prohibitive. On my MacBook Pro, this takes about a minute and a half.
We'll start by defining a function that calculates the number of atoms in the MCS for two molecules.
def mcs_size(mol1,mol2): mcs = rdFMCS.FindMCS([mol1,mol2]) return mcs.numAtoms
Next, we will run this over the small molecule drugs in our dataframe.
mcs_list = [] for mol in tqdm(smiles_df.Mol): mcs_list.append(mcs_size(compound_1_mol,mol))
We can add the number of atoms in the MCS to our dataframe.
smiles_df['mcs'] = mcs_list
Now we sort the dataframe by the number of atoms in the MCS and take a look at the 5 compounds with the largest MCS.
smiles_df.sort_values("mcs",inplace=True,ascending=False) top5_mcs_df = smiles_df.head() MolsToGridImage(top5_mcs_df.Mol,molsPerRow=5,legends=["%d" % x for x in top5_mcs_df.mcs])

While the two most similar molecules are the same as those we identified using the similarity search, we can see that the MCS search uncovers a few different molecules. Let's learn a bit more about the drug that is most similar to Compound 1. We can start by getting its ChEMBL identifier.
top5_mcs_df.ChEMBL_ID.to_list()[0]
We can use the ChEMBL API to get the names associated with this molecule.
molecule = new_client.molecule m1 = molecule.get('CHEMBL1171837') pd.DataFrame([(x['molecule_synonym'],x['syn_type']) for x in m1['molecule_synonyms']],columns=['molecule_synonym','syn_type'])
This molecule is Pontatinib. A quick Google search shows us that this is a marketed drug originally developed as an inhibitor of BCR-ABL. Ponatinib is also a promiscuous inhibitor of a number of other kinases including DDR1. In fact, extensive SAR around the activity of Ponatinib analogs against DDR1 was reported in a 2013 paper in J.Med.Chem.
Let's generate a side-by-side visualization of Compound 1 and Ponatinib with the MCS highlighted.
ponatinib_mol = top5_mcs_df.Mol.to_list()[0] compound_1_mcs = rdFMCS.FindMCS([compound_1_mol,ponatinib_mol]) mcs_query = Chem.MolFromSmarts(compound_1_mcs.smartsString) AllChem.Compute2DCoords(mcs_query) for m in [compound_1_mol,ponatinib_mol]: AllChem.GenerateDepictionMatching2DStructure(m,mcs_query) compound_1_match = compound_1_mol.GetSubstructMatch(mcs_query) ponatinib_match = ponatinib_mol.GetSubstructMatch(mcs_query) MolsToGridImage([compound_1_mol,ponatinib_mol],highlightAtomLists=[compound_1_match,ponatinib_match],subImgSize=(400, 400))

Given the similarity of Compound 1 to a marketed drug with 9nM biochemical activity against DDR1 and good pharmacokinetics (PK), the activity and PK profiles of Compound 1 are not particularly surprising.


Comments
Post a Comment