Using the Structure-Activity Landscape Index (SALI) to Analyze Data From the SARS-CoV-2 MPro Screen

Last week the good folks at the COVID Moonshot Project released their first set of screening data for compounds designed based on the fragment crystal structures released by the Diamond Light Source. Like many others in the community, I was eager to see the data. I thought it might be useful to share some of my thoughts and some of the techniques that I use to sift through screening data and decide what to do next. All of the code used to perform the analyses in this post is available in a Jupyter notebook on GitHub.

Introduction
In this post, I'm going to focus on a metric known as the Structure-Activity Landscape Index (SALI). This technique was first published in 2008 by Rajarshi Guha and Jonn Van Drie and provides a simple means of identifying pairs of compounds where a small change in chemical structure brings about a large change in biological activity or physical properties. These changes can often help us to identify the parts of the molecule that are most important for activity. We can then synthesize additional molecules that either preserve or change these parts of the molecule. If we are doing structure-based design, we can take a look at crystal structures and try to understand the interactions being made by these parts of the molecules.

As I mentioned above, SALI is a simple but elegant way of looking at data. The metric was defined by Ghua and Van Drie as

Where SALI for compounds i and j is defined as the absolute value of the difference in the activity (A) divided by 1 minus the similarity between molecules i and j. When dealing with IC50 data, like the data in the MPro assay, I typically formulate SALI as

Where pIC50 is the negative log of the molar IC50 and Tanimoto is the Tanimoto coefficient for the fingerprint similarity between the two molecules. One thing to be aware of is that two molecules with a Tanimoto similarity of 1.0 (e.g. stereoisomers for most fingerprints) will cause a divide by zero error. In order to get around this, I typically add 0.001 to the denominator.

Preliminary Analysis and Cleanup of the Data

First off, a quick look at the data indicates that a number of rows in the csv file provided on the COVID Moonshot website are duplicates. We have a number of cases where the same compound has two different identifiers, which are in the column CID. We can check this using the groupby function in the Pandas data analysis library for Python.

for k,v in data_df.groupby("SMILES"):

if len(v) > 1:

print(k,v['IC50 (uM)'].values)

The command above gives this result.

C(=O)N1CCN(CC(=O)Nc2cccc(S(N)(=O)=O)c2)CC1 [nan nan]
CC(C(=O)O)c1ccc2c(c1)[nH]c1ccc(Cl)cc12 [nan nan nan nan]
CC(C)N(C)C(=O)C1CCN(C(=O)CCl)CC1 [nan nan]
CNS(=O)(=O)Cc1ccc2[nH]cc(CCN(C)C)c2c1 [nan nan nan]
COc1ccc2[nH]cc(CCNC(C)=O)c2c1 [nan nan]
CS(=O)(=O)Nc1cccc(C(=O)Nc2cccc(N3CCCC3=O)c2)c1 [nan nan]
Cc1ccc(C)c(S(=O)(=O)N2CCN(C(=O)CCl)CC2)c1 [nan nan]
Cc1ccncc1NC(=O)Nc1cccc(Cl)c1 [68. 68.]
Cc1ccncc1NC(=O)Nc1ccccc1 [nan nan]
Cc1ccncc1NC(=O)c1cccc(NC(=O)Nc2cccnc2)c1 [nan nan]
N#Cc1ccc(NC(=O)Nc2cccnc2)cc1 [nan nan]
N#Cc1cccc(NC(=O)Nc2cccnc2)c1 [nan nan nan nan]
O=C(CCl)N1CCC(C(=O)N2CCCCC2)CC1 [nan nan]
O=C(CCl)N1CCN(Cc2cccc(Cl)c2)CC1 [nan nan nan]
O=C(CCl)N1CCN(Cc2cccc3ccccc23)CC1 [nan nan]
O=C(CCl)N1CCN(Cc2cccs2)CC1 [5.66 5.66]
O=C(CCl)N1CCN(Cc2ccsc2)CC1 [3.27 3.27 3.27]
O=C(CCl)N1CCN(S(=O)(=O)c2c(F)cccc2F)CC1 [3.43 3.43]
O=C(CCl)N1CCN(S(=O)(=O)c2cccc(F)c2)CC1 [4.1 4.1]
O=C(CCl)N1CCN(S(=O)(=O)c2cccs2)CC1 [2.07 2.07]
O=C(CCl)Nc1cccc(N2CCCC2=O)c1 [72. 72. 72.]
O=C(c1cc(=O)[nH]c2ccccc12)N1CCN(c2ccccc2)CC1 [nan nan]

As we can see, the IC50 values are the same for the duplicated rows. According to a post on the forum, this is because multiple individuals submitted the same idea. In order to make our lives simpler, we're going to remove the duplicated rows.

data_df.drop_duplicates("SMILES",inplace=True)

Now, on to the IC50 data. For those unfamiliar with IC50s, the experiment typically takes place in two stages.

Compounds are tested at a single concentration and the % inhibition is measured
The compounds showing a response in step 1 are tested at a range of concentrations and the IC50 is determined.

In this case, compounds were initially tested at 20 and 50uM then followed up with dose-response. As a result, we have a number of compounds that don't have IC50 values. This complicates the SALI calculation, so we'll use 999uM as a surrogate value for compounds that don't have an IC50 value. Fortunately Pandas makes it easy to do this.

data_df.fillna({'IC50 (uM)': 999},inplace=True)

This dataset contains both covalent and non-covalent compounds. When looking at SAR, I usually like to separate the two sets. It can sometimes be useful to combine the sets to examine the impact of covalent binding, but for an initial analysis I'm going to treat the two separately. In taking a quick look at the data, we can see that the covalent molecules contain either chloromethylketones or acrylamides as the covalent warhead. We can use a couple of SMARTS queries to label these compounds in our Pandas dataframe.

chloromethylketone = Chem.MolFromSmarts('C(=O)[CH2]Cl')

acrylamide = Chem.MolFromSmarts('C(=O)[CH]=[CH2]')

data_df['covalent'] = [x.HasSubstructMatch(acrylamide) or x.HasSubstructMatch(chloromethylketone) for x in data_df.mol]

Now that we've labeled the compounds, we'll make a quick boxplot to get a rough idea of the IC50 distributions.

Performing the SALI Analysis

Before we start looking at the data, I need to point out a significant caveat to any analysis of this type. This dataset is the first round of follow-up from a fragment screen. As such, even the best compounds have single-digit uM activity. Most of the IC50s for these compounds are at or near the compound's solubility limit. In most of the pairs of compounds we'll be considering, one will have uM activity and the other will be inactive (dead). This lack of activity may be genuine, or may simply be due to the fact that the compound wasn't soluble in the assay buffer. At this point, we don't want to overinterpret the data, but we also don't want to miss any nuggets of SAR.

When I'm looking at early data like this, I'm trying to do a few things.

Form an initial hypothesis about the features that are critical for binding.
Look for interactions that show up more than once. This does a couple of things. First, it provides some evidence that what I'm looking at is real and not an artifact. Second, it may give me some ideas on what I can "mix and match" between molecules.
Since the starting point for this effort was a crystallographic fragment screen, I want to go back to the structures and see if I can rationalize the increases and decreases in activity.

Ultimately I want to decide a few things.

Which of these compounds do I want to try to get crystal structures of? Often the most informative structures come from cases where the SAR doesn't make sense.
Which compounds do I want to synthesize to better understand the SAR?
Which compounds do I want to synthesize to improve activity?

Looking at pairs of similar compounds whose activities differ can help us to answer all three of the questions above. In the GitHub repo, I have code that loops over all pairs of compounds and calculates the SALI. I then add this data to a Pandas dataframe and sort by SALI to identify interesting pairs. As I mentioned above, the SALI analysis highlights pairs of compounds. One simple trick when dealing with pairs of compounds is to align them on their maximum common subgraph (MCS). It's really annoying to have to do mental gymnastics to determine the correspondence between a pair of compounds when they're drawn like this.

It's much easier, at least for me, when the structures are aligned, like this.

If we want to get really fancy, we can highlight the parts that are common to both molecules. I'm still trying to decide whether the structure highlighting is helpful or distracting.

One note on this part. You'll notice that in the code, I precalculate the MCS. This is because this calculation takes a bit of time and I wanted to make the interactive component in the notebook (see below) as responsive as possible. Speaking of that component, here's a movie showing how it works. If you click on the knob in the slider, you can use the right arrow key to move quickly through the pairs and pick out interesting insights. In the viewer, the more active compound is always shown on the left. The viewer shows the structures, the associated compound names and IC50s, as well as the magnitude of the fold change between the two structures. Remember that we set the IC50 to 999uM for the inactive compounds so take this value with a large grain (block) of salt.

I built the viewer using the ipywidgets library, which I keep finding new uses for. If you use Jupyter notebooks, it's definitely worth checking out. Note that the viewer isn't perfect, while pairs of molecules are oriented the same way, the same scaffold can have different orientations (e.g. the ureas) in different pairs. It would probably be useful to add some additional logic for scaffold alignment.

A Few Observations
The SALI analysis enables us to pull out a few quick trends in the data. Let's look at the non-covalent compounds. It looks like substituents at the 3-position on the phenyl group are beneficial.

It also appears that the methyl group at the 4-position on the pyridyl group is required for activity.

The structures above are an example of what I was talking about earlier, I should have added some additional logic to align all of the structures with the same scaffold the same way. I could go on picking out interesting pairs all day, but this post has already gotten pretty long. All of the code is available on GitHub so please use it to examine the data and design some compounds. As always, please let me know if you have any questions.

Search This Blog

Practical Cheminformatics

Using the Structure-Activity Landscape Index (SALI) to Analyze Data From the SARS-CoV-2 MPro Screen

Comments

Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

Silly Things Large Language Models Do With Molecules

Generative Molecular Design Isn't As Easy As People Make It Look