Examining the Data From the ChEMBL SARS-CoV-2 Drug Repurposing Screens

One interesting dataset in the ChEMBL 27 release is a compilation of several drug repurposing screens for SARS-CoV-2. Given recent comments around the lack of consistency in these screens, I was eager to take a look at the data. I thought it might also be interesting to share some of the techniques I used to explore the screening results. It's my hope that readers will find some of the ways I manipulate data useful for their analyses. As usual, all of the code associated with this post is in a Jupyter notebook on GitHub.

1. Getting the data from ChEMBL
As a first step, we need to construct an appropriate SQL query to extract the data we want to examine. I'm using MySQL to access the ChEMBL data, not because it's the world's greatest database, but because I've been using it for a long time, and I have all of the necessary commands memorized. I'm not going to use a fancy Object Relational Mapper (ORM); this is a one-off analysis, so plain old SQL is just fine.

select cs.canonical_smiles, cs.molregno, usan_tbl.synonyms, act.assay_id, act.standard_value, act.standard_type, act.standard_units from activities act
join compound_structures cs on cs.molregno = act.molregno
join assays a on act.assay_id = a.assay_id
join source s on a.src_id = s.src_id
left join (select molregno, synonyms from molecule_synonyms where syn_type = 'USAN') as usan_tbl on usan_tbl.molregno = cs.molregno
where src_description = 'SARS-CoV-2 Screening Data'

The query I used is shown above, it's pretty straightforward, the only tricky bit is the subquery to join the drug names in the molecule synonyms table. The core of the query is this part that joins the four tables necessary to grab the data.

select cs.canonical_smiles, cs.molregno, act.assay_id, act.standard_value, act.standard_type, act.standard_units
   from activities act
   join compound_structures cs on cs.molregno = act.molregno
   join assays a on act.assay_id = a.assay_id
   join source s on a.src_id = s.src_id
   where src_description = 'SARS-CoV-2 Screening Data

The four tables we're joining are

compound_structures - contains SMILES for the molecules, linked by molregno
activities - contains assay values, result type (IC50, etc.), result units, linked by assay_id
assays - contains assay information, linked by assay_id and source_id
sources - contains assay set description linked by src_id

Now for a small amount of ninja SQL. We want to link the United States Adopted Name (USAN) for the drugs. The problem here is that not all of the compounds assayed have a USAN. If we simply joined the molecule_synonyms table, we would lose the vast majority of our data. The trick is to use this bit of code to create a temporary table with the USAN data from the molecule_synonyms table.

left join (select molregno, synonyms from molecule_synonyms where syn_type = 'USAN') as usan_tbl on usan_tbl.molregno = cs.molregno

Note that we're doing a left join here. This will keep all of the data from the initial query and add the USAN for molecules having it and null for the molecules that don't have a USAN.

Fortunately, Pandas has a nifty read_sql method that enables us to pass a database handle and a query and get back a dataframe.

df = pd.read_sql(query,con=con)

After executing the query, we have a dataframe that looks like this.

Note that the synonyms column has "None" for all of the molecules that don't have a USAN. This isn't really what I want. I'd prefer to have a column with the USAN if it exists; otherwise, I'd like to have the molregno. Fortunately, Pandas has a function fillna, that can take care of this for us.

df.synonyms = df.synonyms.fillna(df.molregno)

If we look at the dataframe again, we've filled in all of the missing values, magic!

2. Limiting the analysis to dose-response data

There is a lot of data in this table and, as a first pass, I only want to look at the dose-response data, so we'll do a query to extract the data where the units are "nM." Yes, I realize that we don't want to directly compare IC50, CC50, etc., but stick with me, we'll take care of that soon. We can extract this data using the Pandas query function.

df_nM = df.query("standard_units == 'nM'")

3. Averaging replicates

We also have several cases where there are multiple replicates for a single compound. I'd like to have the table contain the mean value when multiple replicates are present. We can use the Pandas groupby function to calculate the mean value for each assay value.

unique_list = []

grouping_cols = ["canonical_smiles","molregno","synonyms","assay_id","standard_type"]

for k,v in df_nM.groupby(grouping_cols):

val = v.standard_value.mean()

unique_list.append(list(k)+[-np.log10(val*1e-9)])

unique_df = pd.DataFrame(unique_list,columns=grouping_cols+["pX50"])

Note that we've also converted the mean activity to -log10(activity) so we're displaying the pIC50, pKi, etc. There are some subtle factors to combining replicates that we're ignoring here. For instance, let's say that we have three replicates with values of 30uM, 33uM, and >40uM. What's the mean value? There's a lot to unpack there so we'll save that for another day.

Now that we've only considered dose-response data and averaged the replicates, our dataframe is considerably smaller.

4. Pivoting the data

We've made it easier to process the data, but the table is not in an appropriate format for viewing. When the table comes from the database, it is in what is commonly known as a tall narrow format. Each compound is represented multiple times, once for each assay value. This format is optimal for storage and searching, but it is not great for viewing. To view the data, we need to pivot it so that we have a new table in a short wide format with one row for each compound and one column for each assay.

Again, Pandas comes to our rescue with the pivot function.

pivot_df = unique_df.pivot(index='synonyms',columns="assay_id",values="pX50")

Now, let's sort the data so that the compounds that have been run in the largest number of assays are at the top.

pivot_df['num_null'] = pivot_df.isnull().sum(axis=1).values

pivot_df.sort_values("num_null",inplace=True)

pivot_df.drop('num_null', axis=1, inplace=True)

After pivoting and sorting, we have a new dataframe that looks like this.

5. Visualizing the data

To get a quick overview of this data, we can use seaborn to create a heatmap where rows represent compounds, columns indicate assays, and the color represents the pX50.

cmap = sns.cm.rocket_r

sns.set(rc={'figure.figsize': (20, 12)})

sns.heatmap(pivot_df,cmap=cmap)

As we can see, there is a significant disparity among the assay results.

6. Drilling down to the details

Finally, we can use the interact capability in ipywidgets to create an interactive widget that will allow us to see all of the assay results for a given compound.

synonym_list = sorted(unique_df.synonyms.unique())

@interact(name=synonym_list)

def show_data(name):

query_str = f"synonyms == '{name}'"

return df_nM.query(query_str)

In this example, I hope I've shown you some of the ways that we can use Pandas and SQL queries to explore a dataset. Over the last few years, Pandas has become my swiss army knife for data manipulation, and I'm still discovering new and exciting ways that it makes my life easier.

Search This Blog

Practical Cheminformatics

Examining the Data From the ChEMBL SARS-CoV-2 Drug Repurposing Screens

Comments

Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

Silly Things Large Language Models Do With Molecules

Some Thoughts on Splitting Chemical Datasets