Examining the Data From the ChEMBL SARS-CoV-2 Drug Repurposing Screens

 One interesting dataset in the ChEMBL 27 release is a compilation of several drug repurposing screens for SARS-CoV-2.  Given recent comments around the lack of consistency in these screens, I was eager to take a look at the data.  I thought it might also be interesting to share some of the techniques I used to explore the screening results.  It's my hope that readers will find some of the ways I manipulate data useful for their analyses.  As usual, all of the code associated with this post is in a Jupyter notebook on GitHub

1. Getting the data from ChEMBL
As a first step, we need to construct an appropriate SQL query to extract the data we want to examine.  I'm using MySQL to access the ChEMBL data, not because it's the world's greatest database, but because I've been using it for a long time, and I have all of the necessary commands memorized.  I'm not going to use a fancy Object Relational Mapper (ORM); this is a one-off analysis, so plain old SQL is just fine.  

select cs.canonical_smiles, cs.molregno, usan_tbl.synonyms, act.assay_id, act.standard_value, act.standard_type, act.standard_units from activities act
    join compound_structures cs on cs.molregno = act.molregno
    join assays a on act.assay_id = a.assay_id
    join source s on a.src_id = s.src_id
    left join (select molregno, synonyms from molecule_synonyms where syn_type = 'USAN') as usan_tbl on usan_tbl.molregno = cs.molregno
    where src_description = 'SARS-CoV-2 Screening Data'

The query I used is shown above, it's pretty straightforward, the only tricky bit is the subquery to join the drug names in the molecule synonyms table.  The core of the query is this part that joins the four tables necessary to grab the data. 

select cs.canonical_smiles, cs.molregno, act.assay_id, act.standard_value, act.standard_type, act.standard_units
    from activities act
    join compound_structures cs on cs.molregno = act.molregno
    join assays a on act.assay_id = a.assay_id
    join source s on a.src_id = s.src_id
    where src_description = 'SARS-CoV-2 Screening Data

The four tables we're joining are 

  • compound_structures - contains SMILES for the molecules, linked by molregno
  • activities - contains assay values, result type (IC50, etc.), result units, linked by assay_id
  • assays - contains assay information, linked by assay_id and source_id
  • sources - contains assay set description linked by src_id
Now for a small amount of ninja SQL.  We want to link the United States Adopted Name (USAN) for the drugs.  The problem here is that not all of the compounds assayed have a USAN.  If we simply joined the molecule_synonyms table, we would lose the vast majority of our data.  The trick is to use this bit of code to create a temporary table with the USAN data from the molecule_synonyms table. 

left join (select molregno, synonyms from molecule_synonyms where syn_type = 'USAN') as usan_tbl on usan_tbl.molregno = cs.molregno

Note that we're doing a left join here.  This will keep all of the data from the initial query and add the USAN for molecules having it and null for the molecules that don't have a USAN.  

Fortunately, Pandas has a nifty read_sql method that enables us to pass a database handle and a query and get back a dataframe.   

df = pd.read_sql(query,con=con)

After executing the query, we have a dataframe that looks like this.


Note that the synonyms column has "None" for all of the molecules that don't have a USAN.   This isn't really what I want.  I'd prefer to have a column with the USAN if it exists; otherwise, I'd like to have the molregno.   Fortunately, Pandas has a function fillna, that can take care of this for us.  

df.synonyms = df.synonyms.fillna(df.molregno)

If we look at the dataframe again, we've filled in all of the missing values, magic! 



2. Limiting the analysis to dose-response data
There is a lot of data in this table and, as a first pass, I only want to look at the dose-response data, so we'll do a query to extract the data where the units are "nM."  Yes, I realize that we don't want to directly compare IC50, CC50, etc., but stick with me, we'll take care of that soon.   We can extract this data using the Pandas query function.  

df_nM = df.query("standard_units == 'nM'") 

3. Averaging replicates 
We also have several cases where there are multiple replicates for a single compound.   I'd like to have the table contain the mean value when multiple replicates are present.  We can use the Pandas groupby function to calculate the mean value for each assay value. 

unique_list = []
grouping_cols = ["canonical_smiles","molregno","synonyms","assay_id","standard_type"]
for k,v in df_nM.groupby(grouping_cols):
    val = v.standard_value.mean()
    unique_list.append(list(k)+[-np.log10(val*1e-9)])
unique_df = pd.DataFrame(unique_list,columns=grouping_cols+["pX50"])

Note that we've also converted the mean activity to -log10(activity) so we're displaying the pIC50, pKi, etc.   There are some subtle factors to combining replicates that we're ignoring here.  For instance, let's say that we have three replicates with values of 30uM, 33uM, and >40uM.  What's the mean value?  There's a lot to unpack there so we'll save that for another day.  

Now that we've only considered dose-response data and averaged the replicates, our dataframe is considerably smaller. 


4. Pivoting the data
We've made it easier to process the data, but the table is not in an appropriate format for viewing.  When the table comes from the database, it is in what is commonly known as a tall narrow format.  Each compound is represented multiple times, once for each assay value.   This format is optimal for storage and searching, but it is not great for viewing.  To view the data, we need to pivot it so that we have a new table in a short wide format with one row for each compound and one column for each assay. 


Again, Pandas comes to our rescue with the pivot function. 

pivot_df = unique_df.pivot(index='synonyms',columns="assay_id",values="pX50")

Now, let's sort the data so that the compounds that have been run in the largest number of assays are at the top.  

pivot_df['num_null'] = pivot_df.isnull().sum(axis=1).values
pivot_df.sort_values("num_null",inplace=True)
pivot_df.drop('num_null', axis=1, inplace=True)

After pivoting and sorting, we have a new dataframe that looks like this. 

 

5. Visualizing the data
To get a quick overview of this data, we can use seaborn to create a heatmap where rows represent compounds, columns indicate assays, and the color represents the pX50. 

cmap = sns.cm.rocket_r
sns.set(rc={'figure.figsize': (20, 12)})
sns.heatmap(pivot_df,cmap=cmap)


As we can see, there is a significant disparity among the assay results. 

6. Drilling down to the details 
Finally, we can use the interact capability in ipywidgets to create an interactive widget that will allow us to see all of the assay results for a given compound. 

synonym_list = sorted(unique_df.synonyms.unique())
@interact(name=synonym_list)
def show_data(name):
    query_str = f"synonyms == '{name}'"
    return df_nM.query(query_str)


In this example, I hope I've shown you some of the ways that we can use Pandas and SQL queries to explore a dataset.  Over the last few years, Pandas has become my swiss army knife for data manipulation, and I'm still discovering new and exciting ways that it makes my life easier. 



Comments

Popular posts from this blog

Generative Molecular Design Isn't As Easy As People Make It Look

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)